Summary: | in high traffic situations, KSharedDataCache corrupts its cache file and crashes | ||
---|---|---|---|
Product: | [Unmaintained] kdelibs | Reporter: | Stefan Majewsky <majewsky> |
Component: | kshareddatacache | Assignee: | Michael Pyne <mpyne> |
Status: | RESOLVED WORKSFORME | ||
Severity: | crash | CC: | cpigat242, majewsky, mpyne, whynot |
Priority: | NOR | ||
Version: | 4.5 | ||
Target Milestone: | --- | ||
Platform: | openSUSE | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
New crash information added by DrKonqi
Avoid a poorly-handled case in KSharedDataCache when there is between 100-150% of requested item size available, and handling out-of-room Valgrind output for the crash mentioned in comment 3 New crash information added by DrKonqi |
Description
Stefan Majewsky
2010-07-04 13:00:28 UTC
DrKonqi put this under Kolf, but it's a KSharedDataCache bug. Well I will say that debugging crashes in the cache while I was developing it turned mostly into an exercise of "thought debugging" since the corruption very often happens far before the crash. The valgrind tool includes helgrind and drd tools for debugging broken threading, would it be possible to use one of the two and see if you can narrow the issue further? I ask because I did use some test applications that "stress test" the cache by loading all possible icons as fast as possible, but it was only single-process. With the whole desktop using KSharedDataCache it's possible there's a race in conjuction with high load that would eventually lead to corruption. Created attachment 50563 [details]
New crash information added by DrKonqi
kgvtestbed (0) on KDE Platform 4.5.00 (KDE 4.5.0) using Qt 4.6.3
Here is another backtrace, which I obtained from a testbed for the new kgamevisuals library, of which KGameRenderer is a fundamental part. This testing application uses KGameRenderer to render the scene background of KDiamond. The crash occurred during a window resizing operation, so again a situation with high load. I can send you the source code if you wish.
-- Backtrace (Reduced):
#7 0xb774eec7 in SharedMemory::defragment (this=0xb27e9000) at /usr/src/debug/kdelibs-4.5.0/kdecore/util/kshareddatacache.cpp:547
#8 0xb774fb8b in SharedMemory::removeUsedPages (this=0xb27e9000, numberNeeded=372) at /usr/src/debug/kdelibs-4.5.0/kdecore/util/kshareddatacache.cpp:719
#9 0xb7721ba6 in KSharedDataCache::insert (this=0x813fd40, key=..., data=...) at /usr/src/debug/kdelibs-4.5.0/kdecore/util/kshareddatacache.cpp:1289
#10 0xb6cd0e3e in KImageCache::insertImage (this=0x813fd40, key=..., image=...) at /usr/src/debug/kdelibs-4.5.0/kdeui/util/kimagecache.cpp:80
#11 0xb75362f6 in KgvRendererPrivate::jobFinished (this=0x8139ff0, job=0x82697f0, isSynchronous=false) at /home/stefan/Code/kde/libkgame/visuals/rendering/kgvrenderer.cpp:502
This backtrace does look more useful, thanks very much for continuing to look into this. I would appreciate you sending the source code, and if you could CC: michael.pyne@gmail.com since I have reduced Internet connectivity that would greatly help me debug it. OK, the testcase is very useful, I can also reproduce the output in question, although not the crash itself. I don't have time to debug further tonight and I will be at work until Saturday morning so hopefully I can track down the underlying issue over the weekend. Created attachment 50839 [details]
Avoid a poorly-handled case in KSharedDataCache when there is between 100-150% of requested item size available, and handling out-of-room
Stefan, what I've found so far is that there is a couple of logic errors in the removeUsedPages handling in KSharedDataCache:
1. Sometimes more pages are requested to be removed than is possible. This is already checked by removeUsedPages, but gives an internal error warning so this is fixed with a qMin macro to ensure a sane request.
2. The code path in question is reached under the condition that there is not enough free pages available, or that there enough free pages but they are too heavily fragmented to handle the request.
First, we check if it's feasible that simply defragmenting would free up the space by checking for 150% of required free space.
BUT, we assume in the else condition when making the removeUsedPages call that the space available was less than 100%, which can result in a negative amount of pages request to removeUsedPages. This gets cast to uint, which results in enormous page free sizes, so I'm not sure how this would actually cause damage, but it seems possible.
I haven't been able to get the testcase to crash yet so this may not be the end of the story, but if you can test with this patch applied to kdelibs it may help get closer to the cause.
I'll try to test the patch, though it is not likely to happen too soon. I'm usually working with a packaged kdelibs, so I'll have to setup the build first. SVN commit 1167620 by mpyne: Attempt to fix a couple of KSharedDataCache bugs. 1. Stefan Majewsky has found a crasher under heavy cache load (bug 243573). The testcase he provided did not crash for me, but did reveal that it was possible to attempt to free a negative amount of pages, since we'd end up in the "not enough room" case and then try to free extra room. It was possible that the way the extra space was added caused more pages to be requested freed than were allocated. AFAICS this should not by itself cause a crash, as removeUsedPages notes that condition and fails. But it's still a bug. Now the number of pages requested is capped to page table size before figuring out the number of pages that are in use. 2. Parker Coates found a crash bug involving a cache sized just large enough to hold exactly 1 page (the page size was == cache size). The number of entries the cache supports is page table size / 2. 1 / 2 in integer math is 0, therefore the entry table size was 0, and the cache eventually crashed when trying to divide by the possible number of entries. This is fixed by ensuring that the cache is sized to support at least a minimum number of different pages. This means that the expectedItemSize parameter becomes more important in determining overall minimum cache size. With default settings (4KiB page size, 256 pages) the minimum cache size is 1MiB. I also added some debugging error messages to try and more easily diagnose these kind of logic errors in the future. This commit is for KDE Platform 4.6. I intend to backport to 4.5.1 as well. CCBUG:243573 M +22 -2 kshareddatacache.cpp WebSVN link: http://websvn.kde.org/?view=rev&revision=1167620 SVN commit 1167621 by mpyne: Backport two KSharedDataCache bugfixes to KDE Platform 4.5.1. 1. Do not attempt to free more pages than are actually allocated. This might fix bug 243573, but I cannot get the testcase to crash. (I even disabled desktop effect for maximum resizing speed ;) 2. Force the cache to have a certain minimum number of pages (currently 256) to avoid crashes if the cache contains only a single page. The commit log for the trunk commit, r1167620, has the detailed reasoning. CCBUG:243573 M +22 -2 kshareddatacache.cpp WebSVN link: http://websvn.kde.org/?view=rev&revision=1167621 I just installed the 4.5.1 update, yet the crash in SharedMemory::defragment from comment 3 is still reproducible here. Created attachment 51150 [details] Valgrind output for the crash mentioned in comment 3 Here is the Valgrind output for my testing application which produces the crash. I have stripped some unrelated memleaks in QTransform to make it more readable for you. SVN commit 1180814 by mpyne: Do not index past the very last page in the cache while defragmenting. Noted during a code review while bored in class (mental note, don't print kshareddatacache.cpp single-sided in the future). Also, while testing an upcoming patch by Alberto Villa I had the opportunity to test the version-change-detection code. Worked fine going from 1 to >1, but when reverting back to 1 the cacheSize came back as 0 somehow, which caused a hang. So check for the version being wrong but not 0 now. This has been tested with Stefan Majewsky's KGameRenderer testbed, and might even fix a crash in SharedMemory::defragment. I don't see how this error would cause the Valgrind output Stefan noted, but it might indirectly cause a later operation to fail I suppose. This commit applies to KDE Platform 4.6. I will also backport to 4.5.2. CCBUG:243573 M +11 -5 kshareddatacache.cpp WebSVN link: http://websvn.kde.org/?view=rev&revision=1180814 SVN commit 1180816 by mpyne: Backport two KSharedDataCache fixes to KDE Platform 4.5.2. The initial commit was r1180814, and fixes the following: * Fix an error in defragmentation that could cause corruption of the index table if the very last page in the cache was in use during defragmentation. This very possibly fixes bug 243573. * A more reliable (well, in theory) check is performed to tell if the cache version has unexpectedly changed for a cache that was actually in use. CCBUG:243573 M +11 -5 kshareddatacache.cpp WebSVN link: http://websvn.kde.org/?view=rev&revision=1180816 Stefan, when you get a chance could you see if you're still able to reproduce this crash? Otherwise I'm going to assume I found it for real this time. ;) Created attachment 52327 [details]
New crash information added by DrKonqi
tagarotestbed (0) on KDE Platform 4.5.2 (KDE 4.5.2) using Qt 4.7.0
I just updated to 4.5.2. I did not encounter the defragment() crash in my first test, yet the original crash is reproducible in the testbed application (again by continued resizing). Because the backtrace is slightly different now, I attach it again.
-- Backtrace (Reduced):
#7 operator int (this=0x81c0798, key=..., destination=0xbf9a7dec) at /usr/include/QtCore/qbasicatomic.h:85
#8 cachePageSize (this=0x81c0798, key=..., destination=0xbf9a7dec) at /usr/src/debug/kdelibs-4.5.2/kdecore/util/kshareddatacache.cpp:274
#9 pageTableSize (this=0x81c0798, key=..., destination=0xbf9a7dec) at /usr/src/debug/kdelibs-4.5.2/kdecore/util/kshareddatacache.cpp:431
#10 indexTableSize (this=0x81c0798, key=..., destination=0xbf9a7dec) at /usr/src/debug/kdelibs-4.5.2/kdecore/util/kshareddatacache.cpp:438
#11 findNamedEntry (this=0x81c0798, key=..., destination=0xbf9a7dec) at /usr/src/debug/kdelibs-4.5.2/kdecore/util/kshareddatacache.cpp:593
Therefore changing this back from NEEDSINFO to NEW. After some more trials, I think that the defragment() crash is gone. On a nearly unrelated note, it's a bit irritating that kDebug() attributes KSharedDataCache's debug messages to KIconLoader. Thanks for the update, although at this point I'm despairing of finding an issue in KSharedDataCache itself. I'm hoping it's not an underlying Qt bug, or some kind of improper usage of QAtomicInt. But I don't see any way for the cache's pageSize attribute to be 0 by this point in execution. It gets set at only one spot. I think what I can do is to verify the sanity of the various cache operands when an existing cache is mapped but even that's not foolproof if some logic error is causing that particular variable to get corrupted. Git commit 561e6494bdd9a02cc8feef649f7dbbd40a1456c3 by Michael Pyne. Committed on 20/05/2012 at 00:13. Pushed by mpyne into branch 'KDE/4.8'. kshareddatacache: Validate cache page size. This commit ensures that the cache page size is actually a power-of-2 and within the band of possible sizes that could possibly have been set. If this is not the case the cache is assumed corrupted and reset. This should help with any cache-corruption bugs caused by a wrong cache page size (although these don't exactly make themselves obvious). More fixes to follow... This one /should/ fix 274252 outright and may be of interest to several others. Related: bug 274252, bug 249362, bug 253665, bug 281217, bug 297815, bug 293954, bug 293447, bug 270915, bug 255233 FIXED-IN:4.8.4 M +26 -1 kdecore/util/kshareddatacache.cpp http://commits.kde.org/kdelibs/561e6494bdd9a02cc8feef649f7dbbd40a1456c3 Git commit ca2a6a59784232857a35b313adc9599efb87bd5e by Michael Pyne. Committed on 21/05/2012 at 01:19. Pushed by mpyne into branch 'KDE/4.8'. kshareddatacache: Adopt KSDCCorrupted for exceptional errors. This involves converting many present assertions (which crash no matter what) and error-code return values (which have to be checked everywhere the return value is used at) into using the KSDCCorrupted exception. The nice thing about using the exception is that it can be trapped and handled so that it does not cause an application crash. There's still a bit more to do -- the end goal is that all accesses to shm, no matter how minor, are vetted beforehand to ensure it won't cause a page fault or bus violation. Related: bug 249362, bug 253665, bug 281217, bug 297815, bug 293954, bug 293447, bug 270915, bug 255233 M +49 -34 kdecore/util/kshareddatacache.cpp http://commits.kde.org/kdelibs/ca2a6a59784232857a35b313adc9599efb87bd5e Thank you for the crash report. As it has been a while since this was reported, can you please test and confirm if this issue is still occurring or if this bug report can be marked as resolved. I have set the bug status to "needsinfo" pending your response, please change back to "reported" or "resolved/worksforme" when you respond, thank you. Dear Bug Submitter, This bug has been in NEEDSINFO status with no change for at least 15 days. Please provide the requested information as soon as possible and set the bug status as REPORTED. Due to regular bug tracker maintenance, if the bug is still in NEEDSINFO status with no change in 30 days the bug will be closed as RESOLVED > WORKSFORME due to lack of needed information. For more information about our bug triaging procedures please read the wiki located here: https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging If you have already provided the requested information, please mark the bug as REPORTED so that the KDE team knows that the bug is ready to be confirmed. Thank you for helping us make KDE software even better for everyone! This bug has been in NEEDSINFO status with no change for at least 30 days. The bug is now closed as RESOLVED > WORKSFORME due to lack of needed information. For more information about our bug triaging procedures please read the wiki located here: https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging Thank you for helping us make KDE software even better for everyone! |