Bug 490675 - [cpu_plugin] k10temp does not work for multi-die systems
Summary: [cpu_plugin] k10temp does not work for multi-die systems
Status: CONFIRMED
Alias: None
Product: ksystemstats
Classification: Frameworks and Libraries
Component: General (other bugs)
Version First Reported In: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Plasma Bugs List
URL:
Keywords:
Depends on:
Blocks: 484019
  Show dependency treegraph
 
Reported: 2024-07-22 21:59 UTC by Thomas Berger
Modified: 2024-09-19 21:04 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Topology of 2970WX ThreadRipper (65.45 KB, image/png)
2024-07-22 21:59 UTC, Thomas Berger
Details
PCI Bus devices (12.29 KB, text/plain)
2024-07-23 18:17 UTC, Thomas Berger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Berger 2024-07-22 21:59:14 UTC
Created attachment 171913 [details]
Topology of 2970WX ThreadRipper

SUMMARY

The k10temp implementastion in the cpu_plugin of ksystemstats maps the value of tdie or tctl to all CPU cores.
This ignores the fact, that there are multi-die CPUs like ThreadRipper.

In `linuxcpuplugin.cpp`, `LinuxCpuPluginPrivate::addSensorsAmd()` is called once for each found k10temp sensor. Each call sets the found sensor for all cpu cores.

On ThreadRipper Systems, more then one die is present (see attachment). 

This looses important information. Also, the k10temp sensor is blacklisted in the lmsensors plugin, despite the fact that this seems to be the only logical way to provide those temperatures. 


STEPS TO REPRODUCE
1. Run a threadripper system
2. get the temperatures for all CPU cores
3. compare them 

OBSERVED RESULT
all values are identical 

EXPECTED RESULT
the values should differ, as different cores are on different dies with different temperatures

SOFTWARE/OS VERSIONS
git commit b994c553f2e5d5d235f289c0112f1509b18e4e45

ADDITIONAL INFORMATION
This may also be related to missing temperature values over all #474766
Comment 1 Thomas Berger 2024-07-22 22:03:32 UTC
(In reply to Thomas Berger from comment #0)
> This may also be related to missing temperature values over all #474766

Sorry, i used the wrong ID, here is the correct one: https://bugs.kde.org/show_bug.cgi?id=484019
Comment 2 Thomas Berger 2024-07-22 22:11:39 UTC
Output of sensors:
# sensors k10temp-pci-\*                                                                                                                                                                                                                                                                                                                                                            Di 23 Jul 2024 00:10:59 CEST
k10temp-pci-00d3
Adapter: PCI adapter
Tctl:         +59.4°C  
Tdie:         +32.4°C  

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +60.0°C  
Tdie:         +33.0°C  
Tccd1:        +59.2°C  
Tccd2:        +59.8°C  
Tccd3:        +57.8°C  

k10temp-pci-00db
Adapter: PCI adapter
Tctl:         +58.6°C  
Tdie:         +31.6°C  

k10temp-pci-00cb
Adapter: PCI adapter
Tctl:         +58.9°C  
Tdie:         +31.9°C
Comment 3 Jiri Palecek 2024-07-23 13:36:56 UTC
Yeah, that would be nice, but there would need to be a reliable way to assign the sensors to the individual cpus. Just out of curiosity, could you post the output of

grep -e . -r /sys/bus/pci/devices/0000\:00\:{18,19,1a,1b}.3/

Also, I wonder what are the Tccd* temperatures. Why are they only 3? And why are they so high?
Comment 4 Thomas Berger 2024-07-23 18:17:58 UTC
Created attachment 171934 [details]
PCI Bus devices
Comment 5 Thomas Berger 2024-07-23 18:21:16 UTC
(In reply to Jiri Palecek from comment #3)
> Yeah, that would be nice, but there would need to be a reliable way to
> assign the sensors to the individual cpus.

After further thoughts, i don't think this scales well on user side. Having 48 Temperature values for 4 DIEs, this will most likely get very confusing. I understand that there is the wish to keep the behavior identical across different systems with different CPU types, but AMD works differently, and especially on workstations with a high CPU count, having 4 clearly defined values instead of 12x4 may make more sense.

> Just out of curiosity, could you post the output of
> 
> grep -e . -r /sys/bus/pci/devices/0000\:00\:{18,19,1a,1b}.3/

See attachment.

> Also, I wonder what are the Tccd* temperatures. Why are they only 3? And why
> are they so high?

From my understanding, those are the Tctl temperature of the other 3 dies.
Comment 6 Thomas Berger 2024-07-23 19:48:54 UTC
Further investigations into https://bugs.kde.org/show_bug.cgi?id=484019 showed, that this "design issue" and the bug are indeed related. The current assumption of the code, that there will always be only one, and exactly one, k10temp sensor per AMD system (which is obviously false) leads to the error seen (missing values).
Comment 7 Thomas Berger 2024-07-23 20:32:57 UTC
I would like to propose the following fix:

1. Add a new list of SensorProperties to LinuxAllCpusObject
2. Add each Tdie temperature (one per k10temp) to this list as "DIE%id temperature"
3. If more then one k10temp sensor is found REMOVE the temperature objects from the CPU cores

This would allow users to monitor their temperatures on multi-DIE systems, while users with single-DIE CPUs don't have changes in their views.

Instead of removing the temperature values from each CPU object, we could calculate an average over all DIE temperatures and provide them there, for consistency, but this may provide a false impression to the user about the current state of his thermals.

I am willing to implement those changes and provide a patch.
Comment 8 bmstettin 2024-09-09 16:18:54 UTC
Dear  Maintainer,
can please put your search for a perfect solution on hold, and instead go with the working solution.

Your attempt to be user friendly has gotten out of bounds.
To deny users access to the CPU Temperature sensors while the search for the perfekt solution is going on, isn't exactly user friendly. 

version 6.1.4 still no working  Temp Sensors  for multi die cpu 

There was a Working solution  
https://invent.kde.org/plasma/ksystemstats/-/merge_requests/84

I was forced with the update to KDE6 to uninstall ksysguard , therefore my patience is running a bit thin now.