Summary: | Kwin hangs, stops drawing the screen and starts using 100% cpu inside nvidia-glcore after modifying compositing settings | ||
---|---|---|---|
Product: | [Plasma] kwin | Reporter: | Simeon Bird <bladud> |
Component: | scene-opengl | Assignee: | KWin default assignee <kwin-bugs-null> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | adloconwy+kdebug, auxsvr, bladud, sergio.callegari, simonandric5, sombragris |
Priority: | NOR | Keywords: | drkonqi |
Version: | unspecified | Flags: | thomas.luebking:
NVIDIA+
|
Target Milestone: | --- | ||
Platform: | unspecified | ||
OS: | Linux | ||
See Also: |
https://bugs.kde.org/show_bug.cgi?id=346116 https://bugs.kde.org/show_bug.cgi?id=348753 |
||
Latest Commit: | 1de1e80d5077157fc25503c4699969c57929795d | Version Fixed In: | 5.3 |
Sentry Crash Report: | |||
Attachments: |
Backtrace when kwin is stuck and screen is not drawing
Patch to 'fix' hang on deletion by not deleting object stop stap control before deleting sync objects Another patch to fix the hang by manually triggering the xcb fence. Updated patch to fix hang by triggering fence |
Description
Simeon Bird
2015-01-30 02:26:51 UTC
*** This bug has been marked as a duplicate of bug 343543 *** Have you been in the "compositing" (where you can change the backend etc.) or the "effects" (where you can swith on/off wobbly windows etc.) kcm? The compositing, where you can change the backend. In plasma 5.1 the crash was also present, but took down kwin as well. I spoke too soon - actually changing the settings causes kwin and kded5 to freeze and take 100% of a cpu each. Changing what settings? There's a reported eventloop recursion for kwin in the tabbox, see bug #340294 There's also a bug report for kded5, seems caused by the powerdevil module, see bug #337674 Neither would be related to this crash - it's something in QML - apparently rather the "new" QML context then the closed one (as the compositing kcm doesn't use it) But I've not yet checked whether the overview really uses QML. Just now it crashed checking the box labelled "skip compositing for full-screen windows". I suspect that changing the backend or any of the other compositing settings would also crash (as in plasma 5.1). If you like I can check whether it still crashes with nouveau. *systemsettings* crashed with *that* backtrace for altering a kwin setting?? Sorry, I was imprecise. systemsettings crashes with this backtrace when exiting the compositing kcm after not changing a setting. If I do change a setting in the compositing kcm, both kwin and kded5 hang. This may involve a crash, but it may not - it hangs and there is no backtrace window. In fact when kwin hangs there is no crash at all, at least not one which creates a coredump. When kwin hangs with 100% cpu usage, I obtained a backtrace by attaching gdb to the hung process. The top four lines looked like: sched_yield /usr/lib/libc6 ??? /usr/lib/libnvidia-glcore.so.304.125 ??? /usr/lib/libnvidia-glcore.so.304.125 ??? /usr/lib/libnvidia-glcore.so.304.125 and below that was kwin, which doesn't have symbols at the moment. Note that I have enabled Triple buffering in xorg.conf as suggested in https://bugs.kde.org/show_bug.cgi?id=322060 If I export __GL_YIELD="USLEEP" instead, the top lines in the backtrace are nanosleep and usleep. Created attachment 90827 [details]
Backtrace when kwin is stuck and screen is not drawing
Created attachment 90828 [details] Patch to 'fix' hang on deletion by not deleting object This patch fixes the kwin hang for me, at the cost of leaking memory. It seems that this must be a bug in the nvidia driver? I found this by googling: https://www.opengl.org/discussion_boards/showthread.php/171741-NVIDIA-bug-in-glDeleteSync Re-opened and updated title, since I realise there are two different bugs here. Would either of those patches do as well? (1st one preferably) diff --git a/scene_opengl.cpp b/scene_opengl.cpp index 7584dd5..486af4d 100644 --- a/scene_opengl.cpp +++ b/scene_opengl.cpp @@ -120,6 +120,8 @@ SyncObject::SyncObject() SyncObject::~SyncObject() { + if (m_state == Waiting) + glFinish(); xcb_sync_destroy_fence(connection(), m_fence); glDeleteSync(m_sync); ----------------------------------- diff --git a/scene_opengl.cpp b/scene_opengl.cpp index 7584dd5..e095157 100644 --- a/scene_opengl.cpp +++ b/scene_opengl.cpp @@ -412,6 +412,7 @@ SceneOpenGL::~SceneOpenGL() // do cleanup after initBuffer() SceneOpenGL::EffectFrame::cleanup(); if (init_ok) { + glFinish(); delete m_syncManager; // backend might be still needed for a different scene I tried both patches. Unfortunately they don't make a difference. (I tried both in turn, not both at once) Ok, let's be more explicit on our needs ;-) loki:/src/KDE4/kwin/> git diff scene_opengl.cpp diff --git a/scene_opengl.cpp b/scene_opengl.cpp index 7584dd5..08256be 100644 --- a/scene_opengl.cpp +++ b/scene_opengl.cpp @@ -120,6 +120,8 @@ SyncObject::SyncObject() SyncObject::~SyncObject() { +// if (m_state == Waiting) + glFinish(); xcb_sync_destroy_fence(connection(), m_fence); glDeleteSync(m_sync); @@ -412,6 +414,7 @@ SceneOpenGL::~SceneOpenGL() // do cleanup after initBuffer() SceneOpenGL::EffectFrame::cleanup(); if (init_ok) { + m_backend->makeCurrent(); delete m_syncManager; // backend might be still needed for a different scene Hm. That didn't help either (strangely). *grrrr* Maybe we can trick it through the swap interval.... => Does this also happen if you set the tearing prevention to "none"? If I set tearing prevention to 'none' the problem is fixed! Created attachment 90873 [details]
stop stap control before deleting sync objects
Ok, attached is a larger (and untested!) patch.
Ah, apologies. Turning off tearing prevention doesn't actually make a difference; I was running with the patch from comment 12 by accident. I also tried the patch from comment 21 and it didn't fix it either. Thanks for your help *** Bug 343773 has been marked as a duplicate of this bug. *** Well, I filed a bug report for a different use case (bug 343773) and it has been marked as duplicate of this bug. In my case, Kwin started eating 100% CPU and not drawing anything. Killing kwin_x11 and setting the rendering engine to XRender gave me an usable desktop (after a restart). This is a regression, since the bug was not present in the latest stable kwin from Plasma 4. I wish to add that I am also using the nVidia 304.125 proprietary legacy driver. The backtrace shows that the driver is busy-waiting for something in glDeleteSync(). What do you suppose that function could be waiting for? (In reply to Fredrik Höglund from comment #26) > What do you suppose that function could be waiting for? You believe a nice "xcb_flush(connection());" could do? Sorry, that didn't work either [xcb_flush(connection()); just before the glDeleteSync] I also tried it in conjunction with the glFinish patch above (In reply to Thomas Lübking from comment #27) > (In reply to Fredrik Höglund from comment #26) > > What do you suppose that function could be waiting for? > > You believe a nice "xcb_flush(connection());" could do? You didn't answer my question. (In reply to Fredrik Höglund from comment #29) > You didn't answer my question. I thought I did ;-) Since we already ruled out waiting for the retrace and it's apparently not the fence, i could only imagine it's waiting to get the context active - but I've no favorite supposition (or had) The diver is not supposed to block here, so it could be any kind of (inc. internal) mutex. Created attachment 91085 [details]
Another patch to fix the hang by manually triggering the xcb fence.
The hang occurs when the sync is in the Ready or Resetting state. It seems that nvidia doesn't like it if the gl sync is deleted or waited on before the xcb fence has been triggered.
This patch fixes it - there is no theory behind this, just trial and error. It also seems to me that if wait() is called sufficiently quickly after trigger() there will also be a hang.
(In reply to Simeon Bird from comment #31) > Created attachment 91085 [details] > Another patch to fix the hang by manually triggering the xcb fence. > > The hang occurs when the sync is in the Ready or Resetting state. It seems > that nvidia doesn't like it if the gl sync is deleted or waited on before > the xcb fence has been triggered. > > This patch fixes it - there is no theory behind this, just trial and error. > It also seems to me that if wait() is called sufficiently quickly after > trigger() there will also be a hang. Your patch is absolutely correct, but some of the comments in the code are not. What glDeleteSync() is clearly waiting for is for the fence to become signaled, and that is never going to happen unless kwin tells the X server to trigger it. So it's not really correct to say that we need to manually trigger the fence; it's not something that can happen automatically. It would be a very serious bug if it ever did. The comment above xcb_flush() is also not exactly correct. If the xcb_flush() call is left out, glDeleteSync() will wait for the fence to become signaled, but the trigger request will be stuck in the output buffer and never sent to the X server. So glDeleteSync() ends up waiting indefinitely. There is no need to call wait() before deleting the fence. The purpose of wait() is to prevent the GPU from executing future draw commands before the fence is signaled, and that's not relevant here. There may be a similar hazard between calling wait() and glDeleteSync() without a glFlush() in-between as with calling trigger() and glDeleteSync() without an xcb_flush() in-between. If you want to make sure that the fence is signaled before you call glDeleteSync(), you should call finish() instead of wait(). > Your patch is absolutely correct, but some of the comments in the code are > not. Ok, I'll update the comments and post a new version. Incidentally, is this actually an nvidia bug? ie, does the standard call for glDeleteSync not to block? If the answer is yes, should the patch be made conditional on the nvidia driver somehow? > There is no need to call wait() before deleting the fence. The purpose of > wait() is to prevent the GPU from executing future draw commands before the > fence is signaled, and that's not relevant here. What I was worried about was in some other case - if trigger() is called and then insertWait() is called immediately afterwards as part of the normal draw routines. This would be a classic race condition and would lead to an occasional unrepeatable hang. But maybe it isn't possible for this to happen without something equivalent to xcb_flush? > There may be a similar > hazard between calling wait() and glDeleteSync() without a glFlush() > in-between as with calling trigger() and glDeleteSync() without an > xcb_flush() in-between. If you want to make sure that the fence is signaled > before you call glDeleteSync(), you should call finish() instead of wait(). That's actually fine (I checked when debugging) (In reply to Simeon Bird from comment #33) > > Your patch is absolutely correct, but some of the comments in the code are > > not. > > Ok, I'll update the comments and post a new version. Incidentally, is this > actually an nvidia bug? > ie, does the standard call for glDeleteSync not to block? If the answer is > yes, should the patch > be made conditional on the nvidia driver somehow? I would say that the OpenGL specification strongly implies that glDeleteSync should not block, but it doesn't explicitly say that it's not allowed to. My guess is that there's some limitation that prevents the NVIDIA driver from knowing when it's safe to delete the sync object without blocking on the fence. Triggering the fence before deleting it is not a big deal though, so I wouldn't bother with making it conditional on the NVIDIA driver. It's the only driver that implements the GL_EXT_x11_sync_object extension anyway. > > There is no need to call wait() before deleting the fence. The purpose of > > wait() is to prevent the GPU from executing future draw commands before the > > fence is signaled, and that's not relevant here. > > What I was worried about was in some other case - if trigger() is called and > then insertWait() is called immediately afterwards as part of the normal > draw routines. This would be a classic race condition and would lead to an > occasional unrepeatable hang. But maybe it isn't possible for this to happen > without something equivalent to xcb_flush? That's a good question. It shouldn't matter if the command buffer that signals the fence is submitted after the command buffer that waits for it, as long as both command buffers are able to execute concurrently. This is of course hardware dependent, but all current NVIDIA GPU's should have multiple hardware contexts. The best way to test this is probably to call glWaitSync() and glFlush(), and then tell the X server to trigger the fence. If that results in a GPU hang, we need to make sure that the X server has processed the trigger request before we call glWaitSync(). It might be a good idea to do that anyway for the sake of robustness. (In reply to Fredrik Höglund from comment #32) > What glDeleteSync() is clearly waiting for is for the fence to become signaled (excuse my stupidity) Is this any "clearly" beyond hindsight? From out of all options, I considered this to be the least reasonable one (would that mean the driver is in pre-emptive waiting condition - and what would be the runtime implications for windows that never trigger the fence?) (In reply to Thomas Lübking from comment #35) > (In reply to Fredrik Höglund from comment #32) > > What glDeleteSync() is clearly waiting for is for the fence to become signaled > > (excuse my stupidity) > Is this any "clearly" beyond hindsight? > From out of all options, I considered this to be the least reasonable one A fence is a synchronization primitive that is inserted in the command stream so that you can wait for it and know that all prior commands have completed. So when the function that deletes the associated sync object waits indefinitely for something, my first thought is that it is waiting for the fence. Especially when you consider that at least some of these sync objects are in an unsignaled state, and no fence command has been set in the command stream that will signal them. That Simeon's patch fixes the problem proves the theory. > (would that mean the driver is in pre-emptive waiting condition - and what > would be the runtime implications for windows that never trigger the fence?) Windows don't trigger fences. The fences are triggered from Compositor::performCompositing() immediately after fetching and resetting the damage region, so we can know that the damage has landed in the window textures before we render them. When we are about to render the first damaged window, we insert a command to wait for the fence. If there are no damaged windows, we don't trigger or wait for any fences. Created attachment 91353 [details]
Updated patch to fix hang by triggering fence
Ok, here is a patch with updated comments. How does this get into kwin? Should I open a review board, or do you just take it?
Under similar conditions I get the following backtrace while kwin_x11 uses 100% CPU: #0 0x00007f5a77d62a17 in sched_yield () at /lib64/libc.so.6 #1 0x00007f5a671d5e4e in () at /usr/lib64/libnvidia-glcore.so.304.125 #2 0x00007f5a671d68f6 in () at /usr/lib64/libnvidia-glcore.so.304.125 #3 0x00007f5a66fb5c2f in () at /usr/lib64/libnvidia-glcore.so.304.125 #4 0x00007f5a7795d25e in KWin::SyncObject::~SyncObject() (this=0xbd7ef8, __in_chrg=<optimized out>) at /usr/src/debug/kwin-5.2.2/scene_opengl.cpp:124 #5 0x00007f5a779612cc in KWin::SceneOpenGL::~SceneOpenGL() (this=0xbd7ee0, __in_chrg=<optimized out>) at /usr/include/c++/4.8/array:81 #6 0x00007f5a779612cc in KWin::SceneOpenGL::~SceneOpenGL() (this=0xbd7ee0, __in_chrg=<optimized out>) at /usr/src/debug/kwin-5.2.2/scene_opengl.cpp:242 #7 0x00007f5a779612cc in KWin::SceneOpenGL::~SceneOpenGL() (this=0xba53a0, __in_chrg=<optimized out>) at /usr/src/debug/kwin-5.2.2/scene_opengl.cpp:415 #8 0x00007f5a77961359 in KWin::SceneOpenGL2::~SceneOpenGL2() (this=0xba53a0, __in_chrg=<optimized out>) at /usr/src/debug/kwin-5.2.2/scene_opengl.cpp:966 #9 0x00007f5a77948657 in KWin::Compositor::finish() (this=this@entry=0x8e8490) at /usr/src/debug/kwin-5.2.2/composite.cpp:337 #10 0x00007f5a77948c04 in KWin::Compositor::suspend(KWin::Compositor::SuspendReason) (this=0x8e8490, reason=<optimized out>) at /usr/src/debug/kwin-5.2.2/composite.cpp:508 #11 0x00007f5a75e4503f in QMetaObject::activate(QObject*, int, int, void**) (a=0x7fff200490b0, r=0x8a58c0, this=0xcf0ea0) at ../../src/corelib/kernel/qobject_impl.h:124 #12 0x00007f5a75e4503f in QMetaObject::activate(QObject*, int, int, void**) (sender=0xc338a0, signalOffset=<optimized out>, local_signal_index=<optimized out>, argv=0x7fff200490b0) at kernel/qobject.cpp:3702 #13 0x00007f5a76acc662 in QAction::triggered(bool) () at /usr/lib64/libQt5Widgets.so.5 #14 0x00007f5a76aceb48 in QAction::activate(QAction::ActionEvent) () at /usr/lib64/libQt5Widgets.so.5 Should I file a new report? (In reply to auxsvr from comment #38) > Under similar conditions I get the following backtrace while kwin_x11 uses > 100% CPU: With 5.3? Otherwise it's very most likely this bug and should be fixed/worked around in 5.3 I'm sorry, I didn't see that 5.3 fixes this. I'm on 5.2.2. Can someone clarify if this is expected to be fixed in 5.3? I have just upgraded my system to kubuntu 15.04 that uses plasma 5 and lets either 5.2 (default) or 5.3 (via a dedicated repository) be installed. Unfortunately, with neither of them I succeed in using kwin_x11 with opengl glx together with a system with a nvidia geoforce 7025 / nforce 630 with the nvidia 304 legacy driver that reports opengl 2.1 on this hardware. Can someone clarify if this is expected to be fixed in 5.3? I have just upgraded my system to kubuntu 15.04 that uses plasma 5 and lets either 5.2 (default) or 5.3 (via a dedicated repository) be installed. Unfortunately, with neither of them I succeed in using kwin_x11 with opengl glx together with a system with a nvidia geoforce 7025 / nforce 630 with the nvidia 304 legacy driver that reports opengl 2.1 on this hardware. It is fixed for me on 5.3 - same driver and glx > with neither of them I succeed in using kwin_x11 with opengl glx
Are you sure it's for this particular bug?
This one's caused by a hanging fence sync and apparently triggered by invoking the config module.
It should be work-a-roundable by
export KWIN_EXPLICIT_SYNC=0; kwin_x11 --replace &
Tried... seems to work with the workaround on 5.3. So I guess it is not fixed in 5.3, or at least not in Kubuntu's 5.3... Do you get 100% cpu load instead? If so, can you gdb into kwin and check where it hangs? If not, the sync fences may cause an "unrelated" problem for you. After the utopic->vivid upgrade, I get the machine in over 60% iowait, 0 cpu load, but I do not think this is related. Possibly it is another (and more serious) issue with ubuntu vivid. When kwin is hung, continuously switching to a virtual console and back to the X11 screen with ALT+FN lets one do operations in steps... The machine is using the KDE 5.3 ppa right now. Setting the environment variables makes it almost usable (apart from the other issues mentioned above, like the 60% iowait and io-related processes getting stuck). In any case, the machine is now off and I am reinstalling it as trusty with kde 4 or mint with cinnamon soon because I cannot afford keeping it off, so unfortunately I will not be able to do more tests or provide a lot of further information. It is a pity, kubuntu decided to make so many changes at the same time in this upgrade, because I really do not have not enough time right now to try decoupling the problems. This bug still seems to exist in Plasma 5.3 on Arch Linux. Let's track remaining issues with this feature and the nvidia legacy driver in bug #348753 It would be great if somebody encountering this could gdb into kwin_x11 and check where it's hanging. Usual suspects would glDeleteSync calls in libkwineffects/kwinglutils.cpp |