Bug 438539 - Don't notify about acknowledged errors in the devices error log
Summary: Don't notify about acknowledged errors in the devices error log
Status: CONFIRMED
Alias: None
Product: plasma-disks
Classification: Plasma
Component: general (show other bugs)
Version: unspecified
Platform: Neon Linux
: NOR wishlist
Target Milestone: ---
Assignee: Plasma Bugs List
URL:
Keywords:
: 438732 439018 439487 439569 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-06-13 08:19 UTC by Bernhard Scheirle
Modified: 2021-08-04 14:32 UTC (History)
7 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bernhard Scheirle 2021-06-13 08:19:18 UTC
Hi,

I have a device that has a 'good' overall smart status, but contains entries in its error log. 
In my case these entries are really old and a extended self test reports no errors.
But it is good that plasam-disks notifies about that:
"The SMART firmware is not reporting a failure, but there are early signs of malfunction. [...] The device error log contains records of errors."

The downside is that the notification is triggered on every start up.
And once a user has done everything he wants to do (e.g. creating a Backup) he is forced to ignore this notification.
This is annoying and has the high risk of accidentally also ignoring further new error entries, more severe errors or errors for a different disk.


It would be nice to be able to acknowledge the error log entries
and then plasam-disks would only notify the user if new error entries are present.
Of cause a overall change in the smart status should still be notified.

To keep it simple, on acknowledgement plasam-disks could only store the total error count
and only notify the user if that changed.

Thanks
Bernhard

SOFTWARE/OS VERSIONS
Operating System: KDE neon 5.22
KDE Plasma Version: 5.22.0
KDE Frameworks Version: 5.82.0
Qt Version: 5.15.3
Kernel Version: 5.4.0-74-generic (64-bit)
Comment 1 elandorr.fenydark 2021-06-14 07:27:40 UTC
Hi,

I agree with the problem, however the suggested solution is probably not a good one.

In my case, I have the same report, except the "device error log contains records of errors" part.
After digging into smartctl, I found no error log indeed, only that one time in the past the temperature went to one degree hotter than expected (WORST=44, THRESH=45), so I believe this is what KDE is reporting (a more detailed message in KDE would have been appreciated).

Given the situation, it doesn't look very dangerous, however I'm in the same case as you, if I ignore the message, I won't get notified of future more severe issues (including error logs or not).

I'm not sure how it could be designed to acknowledge some specific past errors…
Comment 2 Harald Sitter 2021-06-15 15:11:42 UTC
It's tricky to be sure. We don't actually know, or rather we don't care because it'd cost a whole lot code maintenance, how exactly things are failing. If smartctl says there are errors in the logs we just take that information and run with it. Having the ability to acknowledge "yo, I've seen the error and this error isn't an error, go away" is exceptionally tricky to implement because we don't know what "this error" actually is in terms of raw data. I wonder if it might be sufficient to add an option to ignore these "soft" errors? They are not actually indicative of the firmware saying something is wrong but rather something may be wrong but the firmware thinks it's not a biggie, so depending on how much you trust the firmware you could ignore the "soft" errors.

In fact, when this feature was added I did ponder that we could simply disable the notification if we find it raises false positives. Going slightly over temperature limits seems like a false positive to me. That is: we could disable the notification but if you open  kinfocenter manually it'd tell you that there may be signs of problems.
Comment 3 Bernhard Scheirle 2021-06-15 17:19:45 UTC
Another approach would be to allow setting a smartclt return code that gets ignored*¹ (per disk).
Smartctls return code encodes the type of error/warning via a bitmask.

This would allow basic filtering without having special knowledge in plasma-disk.

For example I could ignore the return code 0x40 which indicates errors in the error log,
but would still get a notification if additionally a attribute falls below a threshold (0x40|0x20 = 0x60).

While this is not as powerful as ignoring individual errors, compared to the implementation/maintenance overhead this seems quite nice.

*¹: No notification but still displayed in kinfocenter
Comment 4 Nate Graham 2021-06-15 21:38:42 UTC
Could we offer a per-error option to hide it? Like, take the error, hash it, save that somewhere, and then compare that to a list of ignored error hashes, and then offer the user the ability to ignore that particular error?

I don't think it makes sense to show "minor" errors in KInfoCenter. Regular people don't randomly open KInfoCenter and look for errors like us weirdos do. :)
Comment 5 Harald Sitter 2021-06-16 10:33:54 UTC
I do like Bernhard's suggestion a lot. That'd work neatly. Question is how to put that into a UI that makes sense ^^

@Nate the point to note is that these errors aren't errors. they are "something is wrong or maybe not" info. It's why we used such imprecise wording. According to the firmware the overall status is OK but as per https://bugs.kde.org/show_bug.cgi?id=429804 if your disk firmware is silly it might still say all is OK when sectors are actually failing. So the way I am looking at it, if we had no notification that'd be still very reflective of what the firmware actually does "pretend all is well, when possibly it is not". It's also why I wasn't exactly overjoyed with adding the feature. We are really just hiding crappy firmwares.
Comment 6 Nate Graham 2021-06-16 16:39:47 UTC
*** Bug 438732 has been marked as a duplicate of this bug. ***
Comment 7 Harald Sitter 2021-07-06 04:25:15 UTC
Git commit 6dc6cbc3bc5ab7a76727b55fe697d94778e84262 by Harald Sitter.
Committed on 23/06/2021 at 14:48.
Pushed by sitter into branch 'master'.

don't notify on instabilities

this feature was added with the intent to see whether it would raise too
many false positives. unfortunately that didn't turn up during beta
testing but since release many people found this confusing. let's
disable it for 5.22+ again until we can create a better solution.

the instabilities are still shown in the UI. they are not notified on
though

M  +4    -3    src/smartnotifier.cpp

https://invent.kde.org/plasma/plasma-disks/commit/6dc6cbc3bc5ab7a76727b55fe697d94778e84262
Comment 8 Harald Sitter 2021-07-06 04:25:42 UTC
Git commit e75432da5645bc4f9de96b5c9851ca8332396181 by Harald Sitter.
Committed on 06/07/2021 at 04:25.
Pushed by sitter into branch 'Plasma/5.22'.

don't notify on instabilities

this feature was added with the intent to see whether it would raise too
many false positives. unfortunately that didn't turn up during beta
testing but since release many people found this confusing. let's
disable it for 5.22+ again until we can create a better solution.

the instabilities are still shown in the UI. they are not notified on
though


(cherry picked from commit 6dc6cbc3bc5ab7a76727b55fe697d94778e84262)

M  +4    -3    src/smartnotifier.cpp

https://invent.kde.org/plasma/plasma-disks/commit/e75432da5645bc4f9de96b5c9851ca8332396181
Comment 9 Nate Graham 2021-07-29 17:28:54 UTC
*** Bug 439018 has been marked as a duplicate of this bug. ***
Comment 10 Nate Graham 2021-07-29 17:29:32 UTC
Anything more to do here?
Comment 11 Nate Graham 2021-08-02 14:45:55 UTC
*** Bug 439487 has been marked as a duplicate of this bug. ***
Comment 12 Nate Graham 2021-08-02 14:46:16 UTC
*** Bug 439569 has been marked as a duplicate of this bug. ***
Comment 13 Harald Sitter 2021-08-04 14:32:18 UTC
Yes a rework of the system as a whole. I started but got bogged down on it being a super invasive.

There is also the question if we really should warn on this stuff by default. From the bug reports I've seen there is about twice as much evidence of notifications leading to false positives than not notifying leading to false negatives. And that is also why I've seen none of our end users able to deal with the notification. Unless you know how SMART works and how disk damage **may** manifest you have zero chance of making heads or tales of the soft warnings - even if we actually dumped the raw smartctl output in the UI. This entire feature continues to irk me greatly as it runs counter to the original design goal. If we can't tell the user "yo, change the disk" with reasonable certainty we really should just tell them nothing me thinks.