Bug 380008 - System randomly freezes or crashes to the login screen, glitches until rebooted
Summary: System randomly freezes or crashes to the login screen, glitches until rebooted
Status: RESOLVED FIXED
Alias: None
Product: plasmashell
Classification: Plasma
Component: general (show other bugs)
Version: master
Platform: Other Linux
: NOR major
Target Milestone: 1.0
Assignee: David Edmundson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-05-19 13:52 UTC by Mircea Kitsune
Modified: 2020-11-12 00:39 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Xorg.0.log.old (48.94 KB, application/x-trash)
2017-05-19 13:55 UTC, Mircea Kitsune
Details
Xorg.0.log (106.66 KB, text/x-log)
2017-05-19 13:55 UTC, Mircea Kitsune
Details
xsession-errors-:0 (137.64 KB, text/plain)
2017-05-19 13:55 UTC, Mircea Kitsune
Details
journalctl (902.02 KB, text/plain)
2017-05-19 13:56 UTC, Mircea Kitsune
Details
dmesg (209.44 KB, text/plain)
2017-05-19 13:56 UTC, Mircea Kitsune
Details
lspci (5.81 KB, text/plain)
2017-05-19 13:56 UTC, Mircea Kitsune
Details
Photo of the corrupt image on the screen (1.75 MB, image/jpeg)
2017-06-18 18:59 UTC, Mircea Kitsune
Details
Screenshot of "top" (204.40 KB, image/png)
2017-07-05 13:14 UTC, Mircea Kitsune
Details
Memtest86 screenshot (2.34 MB, image/jpeg)
2017-08-04 12:32 UTC, Mircea Kitsune
Details
Output of "dmesg -w" (89.04 KB, text/plain)
2017-08-05 12:24 UTC, Mircea Kitsune
Details
Output of "dmesg -w" (full) (463.67 KB, text/plain)
2017-08-07 20:52 UTC, Mircea Kitsune
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mircea Kitsune 2017-05-19 13:52:44 UTC
Approximately once every 1 to 3 days of uptime, the system experiences a sudden and inexplicable crash: The image completely freezes in place, although unlike similar crashes in the past I can keep moving the mouse pointer around. A few seconds afterward, I find myself in a black console... and a few seconds after that, I'm back at the login screen. If I attempt to log back in however, the image either freezes again or desktop effects are no longer working without any error message as to why. Not even forcefully restarting X11 (control + alt + backspace twice) fixes the remaining glitches, and the only way to truly recover the system is to also reboot.

The crashes are completely random, but always caused by a desktop event... however it is not exclusive to desktop effects! The freeze will occur even with compositing completely disabled, although slightly more rarely. It seems to occur either when some desktop effects are playing (like the desktop switching cube animation) but it's usually when I select a new window or a panel pops up (alt-tab switching to another window can alone trigger this).

I use the free video drivers and default system packages, all latest versions of openSUSE Tumbleweed. My card is a Radeon R7 370, GCN 1.0 on RadeonSI.
Comment 1 Mircea Kitsune 2017-05-19 13:55:09 UTC
Created attachment 105634 [details]
Xorg.0.log.old
Comment 2 Mircea Kitsune 2017-05-19 13:55:23 UTC
Created attachment 105635 [details]
Xorg.0.log
Comment 3 Mircea Kitsune 2017-05-19 13:55:43 UTC
Created attachment 105636 [details]
xsession-errors-:0
Comment 4 Mircea Kitsune 2017-05-19 13:56:00 UTC
Created attachment 105637 [details]
journalctl
Comment 5 Mircea Kitsune 2017-05-19 13:56:13 UTC
Created attachment 105638 [details]
dmesg
Comment 6 Mircea Kitsune 2017-05-19 13:56:23 UTC
Created attachment 105639 [details]
lspci
Comment 7 Mircea Kitsune 2017-05-19 14:01:56 UTC
Important: This is a mirrored report of an older issue, which has been happening for nearly 3 months now. I decided to post it here as well because it seems like it might be KDE related, considering only Plasma Desktop / KWin appears to trigger this specific crash. The original reports with openSUSE and OpenDesktop can be found below, their comments contain more information about what has been happening as I've been noticing it:

https://bugzilla.opensuse.org/show_bug.cgi?id=1028575
https://bugs.freedesktop.org/show_bug.cgi?id=100306

This is a major problem and I'm trying to find a solution for it ASAP! Any feedback worth sharing is welcome, and I'd like to know of any possible fix or workaround. Please don't suggest desktop compositing however, as I've already verified that the issue takes place with it turned off.
Comment 8 Mircea Kitsune 2017-06-18 18:59:31 UTC
Created attachment 106157 [details]
Photo of the corrupt image on the screen

I have discovered some very important details today. Everyone following up on the report, please see this comment!

Recently I realized that a useful test would be to jump into a different run level once I notice the crash, in order to see how the system behaves there. A few minutes ago another freeze took place, so I instantly hit Control + Alt + F1 to go to a console. What I noticed was pretty remarkable and sheds light on a few aspects:

I could keep typing in the console for nearly 10 seconds, but after that the exact same behavior still took place (monitor turned itself on and off two times then the image froze). This time however I was able to toggle the NumLock led a minute after the crash, while also seeing the HDD led still working; That means this is not (always) a total system freeze such as a Kernel panic... instead it appears to be the image output corrupting and staying that way, freezing only specific components with it (I was still unable to issue a blind reboot command for instance). To put everything into an approximate timeline, this is what happened:

00 seconds in: The crash occurs.
02 seconds in: I notice and instantly hit Control + Alt + F1.
05 seconds in: I'm taken to a console where everything works fine: I see the blinking cursor, can write my login and password, etc.
12 seconds in: Suddenly the monitor turns off and back on several times, then the image remains frozen in place.

This time however, the screen did not remain turned off or black. Instead it stayed stuck in a state showing corrupt lines and rectangles of random colors. I took a photo of my screen with my smartphone, which I attached to this issue.
Comment 9 Mircea Kitsune 2017-07-05 13:14:50 UTC
Created attachment 106444 [details]
Screenshot of "top"

Lots of important new information on this freeze, which was of course ported to the latest openSUSE Tumbleweed system packages and still works:

First and foremost, the problem does not happen in every session, and this is not always influenced by updates! During an interval in which I installed absolutely no relevant package changes, the following has happened: The freeze occurred after about just 8 hours of uptime... after that I restarted the machine, but then I had 4 days of uptime with no freeze! This leads me to believe that certain applications or system actions prepare the system with a "time bomb", which then causes switching between windows or desktops to produce the freeze... however I have no way to know what mines the system and what doesn't yet, as I use too many applications at once to figure out which might be responsible.

Anyway another crash happened today. Once more I quickly hit Control + Alt + F1 to switch to a different runlevel; This caused the image to become corrupted on the monitor, however the system remained responsive and didn't actually freeze. So I went to my mother's computer and logged in via SSH, which indeed still worked. I was able to issue a reboot command, which caused the image to briefly unfreeze as the monitor turned on and off a few more times... I could see a few KDE error messages about applications crashing, before the system actually went ahead and rebooted successfully! However this is only possible if I switch to a console quickly enough when noticing the freeze start to happen, if not the whole machine freezes and not even SSH responds from other devices!

While I was in SSH, I decided to run "top" and take a screenshot of my processes (while the computer was frozen and with corrupt image stuck on the screen). I can't tell if anything is out of the ordinary such as a memory leak, but I'm attaching a screenshot of it here.
Comment 10 Mircea Kitsune 2017-07-05 18:21:55 UTC
Thought I'd also post another detail that might be useful, I'm not sure how much it relates to the freeze but better be safe than sorry; I have the following two environment variables added to my ~/.profile file, which basically tell Mesa to post errors to a log file:

export MESA_DEBUG=1
export MESA_LOG_FILE=/home/mircea/.mesa_stderr

There's one reoccurring line which keeps getting printed in there. It's added periodically with no side effects, but I imagine it could still have some relation to the trigger of the freeze:

Mesa: User error: GL_INVALID_OPERATION in glTexSubImage2D(invalid texture image)
Comment 11 Mircea Kitsune 2017-08-02 13:31:06 UTC
After months of careful testing and experimentation, I have discovered what seems to be the primary trigger of this freeze at last. It's not what triggers it per say, but what "rigs" the system and causes it to crash within the course of the next hours... the actual trigger is alt-tab switching between windows, or certain desktop effects playing.

The freeze is mined into the system when you disable and re-enable KDE desktop compositing. If I hit Alt + Shift + F12 to turn off desktop effects, then hit the key combo to turn them back on... there is a great chance that within a few hours the crash occurs. If I don't toggle compositing on the run and just leave it enabled after the system has started, I seem to be fine... this only happens if I turn it off and back on during runtime. It's uncertain whether anything else mines the system, but this is almost always what seems to do it for me.

Notice: I use OpenGL 3.1 for desktop compositing. I remember selecting OpenGL 2.0 long ago, but that still caused the freeze at that time. I can't use Xrender on a daily basis as many effects don't work with it. No other compositor options seem to affect the problem either.

It would be highly appreciated if at least after this information, the developers and maintainers could finally look at this issue! It has taken me months to confirm this as a cause, and I really hope this information (alongside dozens of comments and logs I have posted) can finally be put to use.
Comment 12 Mircea Kitsune 2017-08-03 13:18:50 UTC
Today I discovered that even when not toggling desktop effects at runtime, the freeze can still be mined into the system. I got a crash after 1 day of uptime, no toggling of desktop compositing required.

I find it remarkable how the cause of the crash appears to have immediately changed after me making the comment above yesterday; I tested my theory that desktop effects are the root for 2 months, yet the moment I publish my observations the behavior changes in less than a day. This further makes me concerned that someone might be deliberately programming this crash using vulnerabilities in system components, solely for how strange this coincidence is. I'm still waiting for the developers to help investigate this further whatever the case, as I cannot find any explanation at this point.
Comment 13 Mircea Kitsune 2017-08-04 12:32:34 UTC
Created attachment 107072 [details]
Memtest86 screenshot

To rule out the possibility of a hardware issue, I ran two Memtest86 5.01 sessions from a Clonezilla bootable CD. The first was in the day for 5 hours, the second was during the night for over 10 hours: The program only registered 3 passes in total, but it did not find any errors. I'll attach a picture just in case any useful information is printed there.
Comment 14 Mircea Kitsune 2017-08-05 12:24:42 UTC
Created attachment 107086 [details]
Output of "dmesg -w"

This is perhaps the most important piece of information I managed to gather on the problem thus far. If you have a technical understanding of this data, please take a look at the log and let us know what it says!

I was able to run a SSH session on my computer from another machine. In it I left the command "dmesg -w" running. I toggled desktop effects last night to provoke a crash today, which happened as expected and allowed me to conduct the test. This is basically what dmesg is seeing in realtime as the system is crashing.

I can't make sense of the information, but it definitely looks descriptive. Although the computer seemed completely frozen locally, the output continued flowing on the other machine printing new information every few seconds. I had to wait in order to catch some of the red lines in the console.
Comment 15 Mircea Kitsune 2017-08-05 13:31:45 UTC
I briefly discussed the above log (output of "dmesg -w") on IRC with someone who seemed to have an understanding of the issue. They pointed out something important which I thought to highlight:

The problem appears to start from 'radeon_vm_bo_invalidate' and is most likely a GPU locking bug. Looking at the stack trace I can see it, alongside explicit mentions of spin lock / CPU soft lockup / stall on CPU. I've also noticed a potentially important message, which although marked as a warning seems to point to a line of source code from the radeon driver:

[58857.640890] WARNING: CPU: 3 PID: 2549 at ../drivers/gpu/drm/radeon/radeon_object.c:84 radeon_ttm_bo_destroy+0xec/0xf0 [radeon]
Comment 16 Mircea Kitsune 2017-08-07 20:52:44 UTC
Created attachment 107130 [details]
Output of "dmesg -w" (full)

Full output of "dmesg -w", recorded by running "dmesg -w > filename.txt". The previous one was incomplete as it was subject to console line limitations, cutting off the moment when the crash actually occurs. I left the command running in a different runlevel; This time the crash didn't shut down the monitor after switching to it (Control + Alt + F1) so I was able to cleanly shut down dmesg then issue a normal reboot. I waited there for about 5 minutes before doing so, to give dmesg time to record as much information as possible. The crash appears to start at the following lines:

[112873.658950] radeon 0000:03:00.0: ring 4 stalled for more than 10024msec
[112873.658953] radeon 0000:03:00.0: GPU lockup (current fence id 0x000000000072f6bd last fence id 0x000000000072f6c1 on ring 4)
Comment 17 Mircea Kitsune 2017-08-07 21:30:12 UTC
I randomly decided to google parts of my dmesg output. I was surprised to discover that someone else has reported a very similar issue, which looks like it might have the same root as mine!

https://bugs.freedesktop.org/show_bug.cgi?id=101325

The dmesg output their provided almost perfectly matches my last log, and they also have a RadeonSI card which further narrows down the problem. The main difference is that they experience this with Unreal Engine 4 Editor, whereas for me the trigger is the Plasma desktop.

That report seems to contain a fair amount of logs, so hopefully bringing it and this together can help produce a solution at long last.
Comment 18 Mircea Kitsune 2017-08-31 18:17:03 UTC
I have important new information. After yet more weeks of testing, I seem to have found both of the common triggers for this issue. The crash happens a few hours after either of the following actions is preformed:

1 - Desktop effects are toggled at runtime. Pressing Alt + Shift + F12 twice to turn compositing off then back on will mine the system with this crash.

2 - I insert my USB stick or external drive into an USB port, mount it and access it in Dolphin, then unmount and remove it. A few hours after I've inserted / removed my drive, the freeze occurs. I suspect this has to do with the device notifier popping up in the system tray, asking what action to preform on the device or telling me the device is safe to unplug.

I'm not sure if the themes I'm using might have any relevancy. Considering this is a graphics problem, I figured I'd share this info as well so others can test them if they wish. I'm using the Plasma / KWin theme Freeze with the default Breeze icons / cursor / widget style:

https://www.opendesktop.org/p/998653/
https://www.opendesktop.org/p/1002663/
Comment 19 Mircea Kitsune 2017-08-31 18:38:20 UTC
Further more, I suspect I now know what the culprit component is. It's very likely that the problem lies within Mesa itself, and was introduced in the switch between 13.0 and 17.0.

This was confirmed by the bug report I linked previously, which I strongly believe is related to the issue I'm experiencing here: Another person there was able to verify that their crash happens with Mesa 17 but not 13. Looking at the dates, I realize that I started experiencing this problem precisely when openSUSE Tumbleweed upgraded from Mesa 13.0 to 17.0: Mesa 17 landed in early March 2017, it was a few days later that the issues began, which I then reported the following week (08 March 2017). See my comment in the other bug for more info on this:

https://bugs.freedesktop.org/show_bug.cgi?id=101325#c22

I also seem to confirm that the issue only affects RadeonSI cards but not R600: My laptop has a Mobility Radeon HD 5470 card (R600) whereas my desktop has a Radeon R7 370 card (RadeonSI). I've been away for two weeks and have been using my laptop exclusively during this time, which has the exact same OS and configuration as my desktop. I was able to preform every task I do on my desktop from my laptop, including the triggers I described above... I have never experienced this freeze with the laptop.
Comment 20 Christoph Feck 2017-09-20 00:01:07 UTC
Thanks for the investigation. I fear there is nothing Plasma developers can do to prevent this issue. QtQuick uses OpenGL extensively by default.

You could try running Plasma with forced software rendering in Mesa or in QtQuick.

This way you can be sure the bug is caused by the OpenGL driver.

See http://doc.qt.io/QtQuick2DRenderer/ and https://www.mesa3d.org/envvars.html
Comment 21 Mircea Kitsune 2017-09-20 00:39:33 UTC
(In reply to Christoph Feck from comment #20)

Thanks for the suggestion. I will give it a try, granted it doesn't require messing with system components in a dangerous manner. What environment variables for Mesa should I be modifying?

Currently I'm still in the process of further narrowing down the cause; I know the crash only occurs when alt-tab switching between windows, possibly also when the plasma panel pops up... so in some form Plasma must be the trigger. Some other action a few hours prior to the crash is also required, apparently toggling desktop effects or inserting an external drive... I don't know to what extent Plasma does that.

The core issue does appear to be in Mesa, or something in the drivers if not. But in case a KDE component is setting some weird shader that causes it to freak out, that's one chain in the link that can be cut. Knowing would also make it easier to reproduce the issue so that it can be fixed where the fault lies.
Comment 22 Mircea Kitsune 2017-11-12 15:04:30 UTC
I'm sorry for having taken so long to get back to this issue: I needed to be sure that what I'm mentioning is correct, which at this point took months of verification to be certain the issue is gone for good.

The problem has finally went away; It has not happened once during 3 months, in which I was able to achieve well over a week of uptime! It disappeared after I've preformed the following 3 changes on my system:

- Modifying my system GTK theme.
- Disabling KMix at startup.
- Uninstalling IBus.

I'm convinced the culprit here was IBus... more specifically its system tray icon. That icon has caused odd glitches in the past, such as making random menus pop up or crashing. It was likely also causing a graphical glitch that introduced this infinite GPU loop. As such the ingredients you should need are:

- A GCN 1.0 RadeonSI AMD card, running on the "radeon" driver.
- A KDE (Plasma 5) Linux OS.
- The IBus input system, with the option to show the system tray icon.

If others can reproduce this, please comment on the issue and let us know! If the problem does not return, I will mostly just be watching this bug from now on; I don't plan on spending days to do more odd tests... especially after receiving nearly no support from the FreeDesktop crew for almost an year, despite giving them a ton of data and how major this issue was.
Comment 23 Justin Zobel 2020-11-11 23:52:32 UTC
Mircea thanks for the extensive research on this issue.

As no others have commented on this bug report and no other reports linked to it I think it's safe to close this bug report down. If you are happy for me to do this please let me know.

I'm setting status to "needsinfo" pending your response, please change back to "reported" or "resolved" when you respond, thanks.
Comment 24 Mircea Kitsune 2020-11-11 23:59:38 UTC
I can't say I've seen this particular issue in quite some time. Similar little glitches yes especially in Wayland, but this seems to have long gone. Thanks for asking first, and feel free to close it as far as I'm concerned: If I see something like this again I'll either reopen or start a new report.