Bug 484019 - CPU Temperature not available for AMD Threadripper 2950x
Summary: CPU Temperature not available for AMD Threadripper 2950x
Status: CONFIRMED
Alias: None
Product: ksystemstats
Classification: Frameworks and Libraries
Component: General (show other bugs)
Version: unspecified
Platform: Arch Linux Linux
: NOR normal
Target Milestone: ---
Assignee: Plasma Bugs List
URL:
Keywords:
Depends on: 490675
Blocks:
  Show dependency treegraph
 
Reported: 2024-03-19 20:16 UTC by duncanyoyo1
Modified: 2024-08-25 07:46 UTC (History)
9 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
sensors-detect (10.83 KB, text/plain)
2024-07-14 19:59 UTC, bmstettin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description duncanyoyo1 2024-03-19 20:16:15 UTC
SUMMARY
***
NOTE: If you are reporting a crash, please try to attach a backtrace with debug symbols.
See https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports
***


STEPS TO REPRODUCE
1. Use AMD Threadripper 2950x CPU ( Likely also affects other models, but I am not sure which, likely most Ryzen and possibly AMD FX series processors as well )
2. Open plasma-systemmonitor
3. Try to add CPU temperature to something. 

OBSERVED RESULT
Tctl/Tdie are not there, CPU Min, Max, and Average all show 0.

EXPECTED RESULT
CPU temps to be shown

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 6.0.2
(available in About System)
KDE Plasma Version: 6.0.2
KDE Frameworks Version: 6.0.0
Qt Version: 6.0.2

ADDITIONAL INFORMATION
Was told to make a new bug for my specific issue. Probably related to id=445917 and id=452763
Comment 1 duncanyoyo1 2024-03-19 20:41:39 UTC
Qt version is 6.6.2 not 6.0.2
Comment 2 Nate Graham 2024-04-11 03:27:44 UTC

*** This bug has been marked as a duplicate of bug 474766 ***
Comment 3 duncanyoyo1 2024-06-12 14:39:02 UTC
https://bugs.kde.org/show_bug.cgi?id=474766

Still have this issue.
Comment 4 duncanyoyo1 2024-06-12 14:49:30 UTC
Still not sure why k10temp is blacklisted. I need that for my CPU temps.

Also not sure why this issue keeps getting closed as a dupe of another ( different ) issue, or as resolved.

It's not, and I only posted in the other bug thread because Nate closed this issue as a dupe of that one.

Now he has locked that thread, so back to this one I suppose.

It is hopefully clear by now that this is a separate issue. The issue is k10temp device is blacklisted in the System Monitor. 

Without that I cannot see my CPU temps.
Comment 5 nic.christin@gmail.com 2024-07-12 08:45:38 UTC
I'm having a similar issue on a Threadripper 1950X. All CPU temperatures (min, avg, max, at well as all per-core temperatures) are reported as 0. If I remember correctly, it started with the update to plasma 6. Worked correctly on plasma 5.

KDE Plasma Version: 6.1.2
KDE Framework Version: 6.3.0
Qt Version: 6.7.2

The sensors themselves are reporting the correct temperatures, as shown by the "sensors" command:

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +67.8°C  
Tdie:         +40.8°C  
Tccd1:        +68.0°C  

k10temp-pci-00cb
Adapter: PCI adapter
Tctl:         +67.8°C  
Tdie:         +40.8°C
Comment 6 bmstettin 2024-07-14 19:59:14 UTC
Created attachment 171663 [details]
sensors-detect
Comment 7 bmstettin 2024-07-14 20:44:09 UTC
Its not a Duplicate of bug 474766 
The Sensors you assume to be there are not existing on Threadripper 1920x 

the in the bug 474766  mentioned sensor " Hardware Sensors/coretemp-isa-0000/Core 0" is not existing on my system.

Where du you find the mysterious Sernsor that should show my cpu temperature  when lm-sensors cant find it?
The last 5 years  Tctl and Tdie  was the only option.


Find the full list of all sensors in my system after sensors-detect with yes to all. sensors detect attached 
~ # sensors
nct6779-isa-0290
Adapter: ISA adapter
Vcore:                 424.00 mV (min =  +0.00 V, max =  +1.74 V)
in1:                     1.08 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
AVCC:                    3.30 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
+3.3V:                   3.30 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in4:                     1.84 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:                   912.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:                     1.37 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
3VSB:                    3.44 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
Vbat:                    3.25 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in9:                     0.00 V  (min =  +0.00 V, max =  +0.00 V)
in10:                  832.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in11:                  864.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:                    1.67 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:                  920.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:                  872.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
fan1:                  2909 RPM  (min =    0 RPM)
fan2:                   910 RPM  (min =    0 RPM)
fan3:                     0 RPM  (min =    0 RPM)
fan4:                  1070 RPM  (min =    0 RPM)
fan5:                     0 RPM  (min =    0 RPM)
SYSTIN:                 +29.0°C  (high =  +0.0°C, hyst =  +0.0°C)  ALARM  sensor = thermistor
CPUTIN:                 +31.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN0:                 +7.0°C    sensor = thermistor
AUXTIN1:                +35.0°C    sensor = thermistor
AUXTIN2:                +33.0°C    sensor = thermistor
AUXTIN3:                +33.0°C    sensor = thermistor
SMBUSMASTER 0:          +56.5°C  
PCH_CHIP_CPU_MAX_TEMP:   +0.0°C  
PCH_CHIP_TEMP:           +0.0°C  
PCH_CPU_TEMP:            +0.0°C  
PCH_MCH_TEMP:            +0.0°C  
PCH_DIM0_TEMP:           +0.0°C  
TSI0_TEMP:              +56.8°C  
intrusion0:            ALARM
intrusion1:            ALARM
beep_enable:           disabled

k10temp-pci-00cb
Adapter: PCI adapter
Tctl:         +56.8°C  
Tdie:         +29.8°C  

nvme-pci-4100
Adapter: PCI adapter
Composite:    +43.9°C  (low  = -273.1°C, high = +84.8°C)
                       (crit = +84.8°C)
Sensor 1:     +43.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:            N/A  

amdgpu-pci-0a00
Adapter: PCI adapter
vddgfx:      680.00 mV 
fan1:         511 RPM  (min =    0 RPM, max = 3600 RPM)
edge:         +37.0°C  (crit = +100.0°C, hyst = -273.1°C)
                       (emerg = +105.0°C)
junction:     +50.0°C  (crit = +110.0°C, hyst = -273.1°C)
                       (emerg = +115.0°C)
mem:          +66.0°C  (crit = +108.0°C, hyst = -273.1°C)
                       (emerg = +113.0°C)
PPT:          51.00 W  (cap = 244.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +56.8°C  
Tdie:         +29.8°C  
Tccd1:        +56.5°C
Comment 8 Thomas Berger 2024-07-19 21:29:09 UTC
I tried to hunt down this issue. While creating a working debug environment is a major pain in the ass, here is what i could get:

1. ksystemstats creates a KSysGuard FeatureSensors class for each CPU Core
2. During the update run in single CPU cores, one update call for each core is triggered
3. Here is my blackbox. I have no glue how to debug the call from ksystemstats to ksysguard, maybe someone can help?
4. After the call to KSysGuard::FeatureSensors::update() a call to ::value() returns an invalid QVariant.

So, either the construction of the feature sensor, or KSysGuard is broken here.
Comment 9 duncanyoyo1 2024-07-19 22:00:25 UTC
I don't think it's KSysGuard in this case, because the KSysGuard program itself actually works for me. I can see the temps with KSysGuard no problem.

If it was at fault I'd expect it to error there as well, but maybe they are not functioning the same?
Comment 10 Jiri Palecek 2024-07-22 10:22:53 UTC
Hi!

(In reply to Thomas Berger from comment #8)
> 3. Here is my blackbox. I have no glue how to debug the call from
> ksystemstats to ksysguard, maybe someone can help?

ksystemstats doesn't call ksysguard(d), it is everything in the /plugins/ source directory. In your case, /plugins/cpu/linuxcpuplugin.cpp.
Comment 11 Thomas Berger 2024-07-22 10:49:20 UTC
(In reply to Jiri Palecek from comment #10)
> Hi!
>
> ksystemstats doesn't call ksysguard(d), it is everything in the /plugins/
> source directory. In your case, /plugins/cpu/linuxcpuplugin.cpp.

`linuxcpu.cpp`, line 34:

>     m_temperature = KSysGuard::makeSensorsFeatureSensor(QStringLiteral("temperature"), chipName, feature, this);
Comment 12 Jiri Palecek 2024-07-22 13:37:42 UTC
(In reply to Thomas Berger from comment #11)
> (In reply to Jiri Palecek from comment #10)
> > Hi!
> >
> > ksystemstats doesn't call ksysguard(d), it is everything in the /plugins/
> > source directory. In your case, /plugins/cpu/linuxcpuplugin.cpp.
> 
> `linuxcpu.cpp`, line 34:
> 
> >     m_temperature = KSysGuard::makeSensorsFeatureSensor(QStringLiteral("temperature"), chipName, feature, this);

Yeah. And? I can't see your point. Incidentally, does this line run when you try to debug ksystemstats (run with arguments "--replace --remain").

Can you verify that

> $ kstatsviewer --remain cpu/cpu0/temperature
> cpu/cpu0/temperature 54.75
> cpu/cpu0/temperature 55.375

works?
Comment 13 Jiri Palecek 2024-07-22 13:56:24 UTC
(In reply to Thomas Berger from comment #11)
> (In reply to Jiri Palecek from comment #10)
> > Hi!
> >
> > ksystemstats doesn't call ksysguard(d), it is everything in the /plugins/
> > source directory. In your case, /plugins/cpu/linuxcpuplugin.cpp.
> 
> `linuxcpu.cpp`, line 34:
> 
> >     m_temperature = KSysGuard::makeSensorsFeatureSensor(QStringLiteral("temperature"), chipName, feature, this);

Maybe I see what you're getting at. This creates a SensorsFeatureSensor from libksysguard (that's different codebase than ksysguard). You might want to debug this: https://github.com/KDE/libksysguard/blob/079998ece198ee210fa16e5fd1f13f49473c94b6/systemstats/SensorsFeatureSensor.cpp#L145
Comment 14 Thomas Berger 2024-07-22 15:17:07 UTC
Sorry for the confusion, of course i was talking about libksysguard. I don't know why, but somehow i dropped the "lib" prefix.

To be able to debug this, i need a debug build of both parts, but the build system makes this hard for me right now, as i am unable to generate a appropriate *Targets.cmake file from the build directory of my libksysguard debug builds, as generated with the install paths only.

And there is no good documentation i could find for such debug environments.

I compared how the sensor object is created from both plugins (the lmsensors and the CPU plugin) and at first glance i could not see a major difference.

My usual debug scenario also does not work here, because i can not run this inside a virtual VM, as we have hardware sensors, not be able to passed through into a VM  (at least not to my knowledge).
Comment 15 Jiri Palecek 2024-07-22 16:18:25 UTC
(In reply to Thomas Berger from comment #14)
> Sorry for the confusion, of course i was talking about libksysguard. I don't
> know why, but somehow i dropped the "lib" prefix.
> 
> To be able to debug this, i need a debug build of both parts, but the build
> system makes this hard for me right now, as i am unable to generate a
> appropriate *Targets.cmake file from the build directory of my libksysguard
> debug builds, as generated with the install paths only.

Oh. I never needed any of that. I can only suggest:

- what is your distribution? Maybe it already has debug symbol packages installable. (eg. debian -> libksysguardsystemstats2-dbgsym)
- do you really need to build ksystemstats with rpath? If not, you don't need to worry about any Targets.cmake files, just point to the debugging libraries with

  LD_LIBRARY_PATH=/path/to/libksysguard-src/.../bin ksystemstats ...

- maybe, you don't need to debug libksysguard. Can you 100% ensure that this point is true

> 4. After the call to KSysGuard::SensorsFeatureSensor::update() a call to ::value() returns an invalid QVariant.

specifically, that it calls update() on the correct sensor?
You could just place a breakpoint in sensors_get_value from libsensors and check the return value. Or, if you have ltrace, you could run

ltrace -e sensors_get_value@* ksystemstats ...

to see at least if there aren't any errors.
Comment 16 Thomas Berger 2024-07-22 19:17:39 UTC
I was able to create a debug prefix. I found a very strange behavior:

1. In `linuxcpu.cpp, line 89, we call `m_temperature->update()`
2. m_temperature is a pointer to `KSysGuard::SensorProperty` created by an earlier call to KSysGuard::makeSensorsFeatureSensor

The assumption would be, that this call ends up in the overload `SensorsFeatureSensor::update()`, but instead it calls the base class implementation SensorProperty::update() which is empty. 

I installed all all plugins in my prefix, and every other plugin using `makeSensorsFeatureSensor` gets the update call "served" by  `SensorsFeatureSensor::update()` and i can break in this function as well. My initial assumption was, that this is a linker issue, so i moved from gcc-14 to clang-18.1, but the effect stays the same.

The lmsensors and the gpu plugin work fine, i can't find the difference in code, that could cause this.

Here is the call from the cpu plugin:

```
* frame #0: 0x00007f8e0cfbf6fa libKSysGuardSystemStats.so.2`KSysGuard::SensorProperty::update(this=0x0000558f0e93dbe0) at SensorProperty.h:110:5
  frame #1: 0x00007f8e07af39f7 ksystemstats_plugin_cpu.so`LinuxCpuObject::update(this=0x0000558f0e8cc6a0, system=7800, user=47299, wait=460, idle=575242) at linuxcpu.cpp:90:20
```

And here is the call from the lmsensors plugin for another sensor

```
* frame #0: 0x00007fbde133f9c0 libKSysGuardSystemStats.so.2`KSysGuard::SensorsFeatureSensor::update(this=0x0000559ed44ac480) at SensorsFeatureSensor.cpp:147:22
  frame #1: 0x00007fbddbe9ea9b ksystemstats_plugin_lmsensors.so`LmSensorsPlugin::update(this=0x0000559ed44aa570) at lmsensors.cpp:71:17
```
Comment 17 Thomas Berger 2024-07-22 22:05:25 UTC
While hunting this down, i have found another issues here: https://bugs.kde.org/show_bug.cgi?id=490675

I could imagine, that this is related. Overriding the same property value multiple times, could trigger some other issue with the implementation of properties or stuff like this. I am not deep enough in Qt to understand the implications.
Comment 18 Jiri Palecek 2024-07-23 13:36:49 UTC
(In reply to Thomas Berger from comment #16)
> I was able to create a debug prefix. I found a very strange behavior:
> 
> 1. In `linuxcpu.cpp, line 89, we call `m_temperature->update()`
> 2. m_temperature is a pointer to `KSysGuard::SensorProperty` created by an
> earlier call to KSysGuard::makeSensorsFeatureSensor

Yeah, but is it really? It could be overwritten here https://github.com/KDE/ksystemstats/blob/b994c553f2e5d5d235f289c0112f1509b18e4e45/plugins/cpu/linuxcpu.cpp#L57 or here https://github.com/KDE/ksystemstats/blob/b994c553f2e5d5d235f289c0112f1509b18e4e45/plugins/cpu/cpu.cpp#L77. Although it totally shouldn't and I couldn't find any recent change in the (scant) git history that could do anything with it. It could be some undefined behavior, but I couldn't find that either. Maybe it could be some linker snafu?

So to check it, if you can, please try this:

1) run ksystemstats under valgrind
2) run gdb (I see you are using lldb, but lldb is totally useless on Debian, so I'm using it)
3) enter commands into gdb:
> target remote |vgdb
# to connect to the valgrind-ed program and debug it
> break LinuxCpuObject::update
> cont
# to set breakpoint and continue
> print m_temperature
# to print the address of the SensorProperty, eg "(KSysGuard::SensorsFeatureSensor *) 0x8a68290"
> monitor check_memory defined 0x8a68290
# to print where the sensor was allocated. use the same memory address as returned from the previous command
# this uses valgrind's bookkeeping info
# and last
> info vtbl m_temperature
# to check the dynamic type of m_temperature

and post the output from gdb.
Comment 19 Thomas Berger 2024-07-23 19:41:07 UTC
Yeah, that proved some of my assumptions yesterday:

The object is the same allocated from
```
void LinuxCpuObject::makeSensors()
{
    BaseCpuObject::makeSensors();
    m_frequency = new KSysGuard::SensorProperty(QStringLiteral("frequency"), this);
    if (!m_temperature) {
        m_temperature = new KSysGuard::SensorProperty(QStringLiteral("temperature"), this);
    }
}
```

And the vtable clearly shows that we are using the base class.

This led me down the correct path:
- We define a Sensor via makeSensorsFeatureSensor for each CPU on the first found k10temp chip
- for the other found chips, we override the newly created sensors with null ptrs

This happens, because  a property is added on sensor creation to our SensorObject (in this case the LinuxCpuObject). `makeSensorsFeatureSensor` bails out, if the sensor already exists on our SensorObject. And it does, we just created it.
If makeSensorsFeatureSensor bails out, a nullptr is returned.

After the call to addSensors, that leads down to the creation of our temperature sensors, `initialize` is called oin all cpu objects, adding the "missing" sensors with default implementations.


While there are multiple ways to fix this, none of them seems like a good idea. We would have to map the temperatures to the cores on the appropriate DIE, or the user looses important information (imagine one DIE sitting near the upper limit because of an thermal/contact issue, but the first DIE is ok ....).

I would propose that we wait how the discussion in https://bugs.kde.org/show_bug.cgi?id=490675 plays out, before taking actions here.


Thx btw, now i learned something new today!
Comment 20 Jiri Palecek 2024-07-24 11:18:27 UTC
(In reply to Thomas Berger from comment #19)
> This led me down the correct path:
> - We define a Sensor via makeSensorsFeatureSensor for each CPU on the first
> found k10temp chip
> - for the other found chips, we override the newly created sensors with null
> ptrs

Yeah! Good catch.

> While there are multiple ways to fix this, none of them seems like a good
> idea. 

Well, for a start, guarding the call to makeSensorsFeatureSensor with an if (!m_temperature) seems warranted. Or else, we could get rid of m_temperature altogether (and make the code cleaner).

> We would have to map the temperatures to the cores on the appropriate
> DIE, or the user loses important information (imagine one DIE sitting near
> the upper limit because of an thermal/contact issue, but the first DIE is ok
> ....).

Yeah, that's exactly true. Also pertains to dual cpu setups. Maybe it could suffice to map the sensors to correct packages through differing NUMA nodes (but are they always different?) and then use the Tccd* sensors and die numbers from /sys/devices/system/cpu/*/topology. That would need experimentation with the actual hardware.

> I would propose that we wait how the discussion in
> https://bugs.kde.org/show_bug.cgi?id=490675 plays out, before taking actions
> here.

Yeah but that's for somebody else to decide.
 
> Thx btw, now i learned something new today!

Good to hear that.
Comment 21 duncanyoyo1 2024-07-24 14:37:38 UTC
I think the NUMA layout depends on how they have the CPU set up. 

I know on mine I have a few options, and if I set the memory to interleaved, it will only show as 1 NUMA node.
Comment 22 triffid.hunter 2024-08-25 07:46:41 UTC
I'm having the same issue with thermal monitor panel widget and an Intel i7 7700k, is this the appropriate bug or is there another one I was unable to find?

lm_sensors finds both coretemp-isa-0000 with 4 temperature readings as well as an nct6793 with a CPUTIN reading, but neither of these show up in thermal monitor panel widget.

It does pick up my disks and GPU, and it used to find coretemp in older plasma versions (currently 5.27.11, not sure which is the last version that worked but it was less than a year ago)

So my CPU temperature readout which was previously working fine now says "OFF" which is a bit strange given that my system is running just fine…