Bug 305023 - Repeatable crash in KSharedDataCache::Private::lock when first KDE app starts
Summary: Repeatable crash in KSharedDataCache::Private::lock when first KDE app starts
Status: RESOLVED FIXED
Alias: None
Product: kdelibs
Classification: Unclassified
Component: kshareddatacache (show other bugs)
Version: 4.9
Platform: OpenBSD Packages OpenBSD
: NOR crash (vote)
Target Milestone: ---
Assignee: kdelibs bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-12 14:09 UTC by Vadim Zhukov
Modified: 2012-09-03 22:30 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In: 4.9.2


Attachments
File inter-process lockingin KSharedDataCache (14.63 KB, patch)
2012-08-17 08:18 UTC, Vadim Zhukov
Details
Konqueror starting output (43.18 KB, text/plain)
2012-08-17 08:20 UTC, Vadim Zhukov
Details
File inter-process lockingin KSharedDataCache, fixed comments (15.48 KB, patch)
2012-08-17 09:05 UTC, Vadim Zhukov
Details
Simple spinlock-based fallback lock (3.25 KB, patch)
2012-08-18 20:08 UTC, Michael Pyne
Details
KShareDataCache locking overhaul (35.28 KB, patch)
2012-08-22 19:44 UTC, Vadim Zhukov
Details
KSharedDataCache locking overhaul v1.1 (35.75 KB, patch)
2012-08-22 19:57 UTC, Vadim Zhukov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vadim Zhukov 2012-08-12 14:09:14 UTC
When first time starting a KDE app (or after killing all of them, including kdeinit processes), KSharedDataCache crashes in sem_timedwait(3). After restarting an app it and others load fine.

Backtrace:
#0  _atomic_lock (lock=0x7c0f7340) at /usr/src/lib/librthread/arch/i386/_atomic_lock.c:20
#1  0x0bd0966f in _sem_wait (sem=0x7c0f7340, tryonly=0, abstime=0xcfbdcb1c, delayed_cancel=0x2bd05c78) at /usr/src/lib/librthread/rthread_sem.c:36
#2  0x0bd097ff in sem_timedwait (semp=0x8be68008, abstime=0xcfbdcb1c) at /usr/src/lib/librthread/rthread_sem.c:197
#3  0x09b92980 in semaphoreTimedLock::lock (this=0x841fe9f0) at kshareddatacache_p.h:255
#4  0x09b92cb7 in KSharedDataCache::Private::lock (this=0x7f7d0a40) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdecore/util/kshareddatacache.cpp:1233
#5  0x09b9494e in KSharedDataCache::Private::CacheLocker::cautiousLock (this=0xcfbdcc58) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdecore/util/kshareddatacache.cpp:1256
#6  0x09b949a7 in CacheLocker (this=0xcfbdcc58, _d=0x7f7d0a40) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdecore/util/kshareddatacache.cpp:1279
#7  0x09b9095b in KSharedDataCache::find (this=0x841fe350, key=@0xcfbdce04, destination=0xcfbdccd8) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdecore/util/kshareddatacache.cpp:1631
#8  0x0c9a68bd in KIconLoaderPrivate::findCachedPixmapWithPath (this=0x81adf500, key=@0xcfbdce04, data=@0xcfbdcdd4, path=@0xcfbdcd7c)
    at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdeui/icons/kiconloader.cpp:861
#9  0x0c9ab2f0 in KIconLoader::loadIcon (this=0x841fe3c0, _name=@0xcfbdced8, group=KIconLoader::Small, size=16, state=0, overlays=@0xcfbdced4, path_store=0x0, canReturnNull=false)
    at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdeui/icons/kiconloader.cpp:1225
#10 0x0c9adc7f in SmallIcon (name=@0xcfbdced8, force_size=0, state=0, overlays=@0xcfbdced4) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdeui/icons/kiconloader.cpp:1632
#11 0x0cb0e351 in KLineEdit::updateClearButtonIcon (this=0x88198b60, text=@0xcfbdcf0c) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdeui/widgets/klineedit.cpp:352
#12 0x0cb0edf9 in KLineEdit::setClearButtonShown (this=0x88198b60, show=true) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdeui/widgets/klineedit.cpp:298
#13 0x0cadca21 in KComboBox::setEditable (this=0x89095700, editable=true) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdeui/widgets/kcombobox.cpp:393
#14 0x0cade04e in KComboBox (this=0x89095700, rw=Variable "rw" is not available.
) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdeui/widgets/kcombobox.cpp:67
#15 0x0cafee74 in KHistoryComboBox (this=0x89095700, parent=0x0) at /usr/ports/pobj/kdelibs-4.9.0/kdelibs-4.9.0/kdeui/widgets/khistorycombobox.cpp:67
#16 0x0edeec26 in KonqCombo (this=0x89095700, parent=0x0) at /usr/ports/pobj/kde-baseapps-4.9.0/kde-baseapps-4.9.0/konqueror/src/konqcombo.cpp:121
#17 0x0edfee82 in KonqMainWindow::initCombo (this=0x7eab0e00) at /usr/ports/pobj/kde-baseapps-4.9.0/kde-baseapps-4.9.0/konqueror/src/konqmainwindow.cpp:2927
#18 0x0ee19bb4 in KonqMainWindow (this=0x7eab0e00, initialURL=@0xcfbdd160, xmluiFile=@0xcfbdd184) at /usr/ports/pobj/kde-baseapps-4.9.0/kde-baseapps-4.9.0/konqueror/src/konqmainwindow.cpp:260
#19 0x0edca155 in KonqMisc::createBrowserWindowFromProfile (_path=@0xcfbdd3c0, _filename=@0xcfbdd394, url=@0xcfbdd358, req=@0xcfbdd2dc, openUrl=true)
    at /usr/ports/pobj/kde-baseapps-4.9.0/kde-baseapps-4.9.0/konqueror/src/konqmisc.cpp:149
#20 0x0ee54e39 in kdemain (argc=2, argv=0xcfbdd518) at /usr/ports/pobj/kde-baseapps-4.9.0/kde-baseapps-4.9.0/konqueror/src/konqmain.cpp:167
#21 0x1c000a92 in main (argc=538189740, argv=0x0) at /usr/ports/pobj/kde-baseapps-4.9.0/build-i386-debug/konqueror/src/konqueror_dummy.cpp:3

Looks like use-after-free bug, but I'm not an expert in KDE internals. Thanks in advance for any clues; I'm slowly moving through code on my own now.
Comment 1 Vadim Zhukov 2012-08-12 14:19:42 UTC
And here is console output, if that matters:

$ konqueror --nocrashhandler 
konqueror(18062)/KSharedDataCache ensureFileAllocated: This system misses support for posix_fallocate() -- ensure this partition has room for at least 10526816 bytes. 
konqueror(18062)/kdecore (KSycoca) KSycocaPrivate::checkDatabase: We have no database.... launching kdeinit
konqueror(18062)/kdecore (kdelibs) KToolInvocation::klauncher: klauncher not running... launching kdeinit
kdeinit4: preparing to launch /usr/local/lib/libkdeinit4_klauncher.so
kdeinit4: Launched KLauncher, pid = 17080, result = 0
Connecting to deprecated signal QDBusConnectionInterface::serviceOwnerChanged(QString,QString,QString)
klauncher(17080)/kio (KIOConnection) KIO::ConnectionServer::listenForRemote: Listening on  "local:/tmp/ksocket-pers/klauncherQ17080.slave-socket"
kdeinit4: opened connection to :0
kdeinit4: preparing to launch /usr/local/lib/libkdeinit4_kded4.so
kdeinit4: Launched KDED, pid = 26715 result = 0
QDBusConnection: session D-Bus connection created before QCoreApplication. Application may misbehave.
QDBusConnection: session D-Bus connection created before QCoreApplication. Application may misbehave.
kded(4356)/KSharedDataCache ensureFileAllocated: This system misses support for posix_fallocate() -- ensure this partition has room for at least 10526816 bytes. 
kdeinit4: Got EXT_EXEC '/usr/local/bin/kbuildsycoca4' from launcher.
kdeinit4: preparing to launch /usr/local/lib/libkdeinit4_kbuildsycoca4.so
klauncher(17080)/kio (KLauncher) KLauncher::processRequestReturn: "/usr/local/bin/kbuildsycoca4" (pid 30192) up and running.
kbuildsycoca4 running...
kbuildsycoca4(30192)/kdecore (KSycoca) KSycocaPrivate::openDatabase: Trying to open ksycoca from "/var/tmp/kdecache-pers/ksycoca4"
kbuildsycoca4(30192) KBuildSycoca::checkTimestamps: checking file timestamps
kbuildsycoca4(30192) KBuildSycoca::checkTimestamps: timestamps check ok
kbuildsycoca4(30192) kdemain: Emitting notifyDatabaseChanged ()
kdeinit4: PID 30192 terminated.
kded(4356)/kdecore (KSycoca) KSycocaPrivate::openDatabase: Trying to open ksycoca from "/var/tmp/kdecache-pers/ksycoca4"
klauncher(17080)/kdecore (KSycoca) KSycocaPrivate::openDatabase: Trying to open ksycoca from "/var/tmp/kdecache-pers/ksycoca4"
kdeinit4: Got EXEC_NEW 'kconf_update' from launcher.
kdeinit4: preparing to launch /usr/local/lib/libkdeinit4_kconf_update.so
klauncher(17080)/kio (KLauncher) KLauncher::processRequestReturn: "kconf_update" (pid 5629) up and running.
kdeinit4: PID 5629 terminated.
kdeinit4: PID 26715 terminated.
konqueror(18062)/kdecore (KSycoca) KSycocaPrivate::openDatabase: Trying to open ksycoca from "/var/tmp/kdecache-pers/ksycoca4"
Segmentation fault (core dumped)
Comment 2 Michael Pyne 2012-08-13 00:10:46 UTC
Vadim,

Thanks for the bug report. Based on what you have described, I would assume that the semaphoreLock::initialize() method in kshareddatacache_p.h doesn't make it to calling the sem_init() calls.

This should only be possible if the sysconf(_SC_SEMAPHORES) call earlier returns a value indicating that semaphores are not supported. I suppose it's possible the exact check being used is incorrect (it's hard to tell which return values for success should be expected at runtime from the POSIX man pages).

What you can do is is remove that check (or alter it to always pass) and see if you get further. Unfortunately there isn't a provision for downgrading a lock to a more generic type once the autodetection in findBestSharedLock (also in kshareddatacache_p.h) has been done.
Comment 3 Vadim Zhukov 2012-08-13 14:03:05 UTC
Thank you for quick answer.

I've checked sem_init() calls, they succeed. Here is a new console output dump, with some introspection kDebug() calls added:

konqueror(14578)/KSharedDataCache ensureFileAllocated: This system misses support for posix_fallocate() -- ensure this partition has room for at least 10526816 bytes. 
konqueror(14578)/kdeui (Wallet) semaphoreLock::initialize: sem_init(&m_semaphore, 0, 1) succeeded
konqueror(14578)/kdecore (KSycoca) KSycocaPrivate::checkDatabase: We have no database.... launching kdeinit
konqueror(14578)/kdecore (kdelibs) KToolInvocation::klauncher: klauncher not running... launching kdeinit
konqueror(14578)/kdecore (kdelibs) KStandardDirs::findExe: findExe( "kdeinit4" , pstr,  QFlags() ) called
unnamed app(19283)/kdecore (kdelibs) KStandardDirs::findExe: findExe( "klauncher" , pstr,  QFlags() ) called
unnamed app(19283)/kdecore (kdelibs) KStandardDirs::findExe: findExe(): returning libexec item  "/usr/local/libexec/klauncher"
kdeinit4: preparing to launch /usr/local/lib/libkdeinit4_klauncher.so
kdeinit4: Launched KLauncher, pid = 1456, result = 0
Connecting to deprecated signal QDBusConnectionInterface::serviceOwnerChanged(QString,QString,QString)
klauncher(1456)/kio (KIOConnection) KIO::ConnectionServer::listenForRemote: Listening on  "local:/tmp/ksocket-pers/klauncherlA1456.slave-socket"
kdeinit4: opened connection to :0
unnamed app(19283)/kdecore (kdelibs) KStandardDirs::findExe: findExe( "kded4" , pstr,  QFlags() ) called
kdeinit4: preparing to launch /usr/local/lib/libkdeinit4_kded4.so
kdeinit4: Launched KDED, pid = 29327 result = 0
QDBusConnection: session D-Bus connection created before QCoreApplication. Application may misbehave.
QDBusConnection: session D-Bus connection created before QCoreApplication. Application may misbehave.
kded(27330)/KSharedDataCache ensureFileAllocated: This system misses support for posix_fallocate() -- ensure this partition has room for at least 10526816 bytes. 
kded(27330)/kdeui (Wallet) semaphoreLock::initialize: sem_init(&m_semaphore, 0, 1) succeeded
kded(27330)/KSharedDataCache KSharedDataCache::insert: Overwriting existing cached entry due to collision.
kded(27330)/KSharedDataCache KSharedDataCache::insert: Overwriting existing cached entry due to collision.
kded(27330)/KSharedDataCache KSharedDataCache::insert: Overwriting existing cached entry due to collision.
kded(27330)/KSharedDataCache KSharedDataCache::insert: Overwriting existing cached entry due to collision.
kded(27330)/KSharedDataCache KSharedDataCache::insert: Overwriting existing cached entry due to collision.
kded(27330)/kdecore (kdelibs) KStandardDirs::findExe: findExe( "drkonqi" , pstr,  QFlags() ) called
kded(27330)/kdecore (kdelibs) KStandardDirs::findExe: findExe(): returning libexec item  "/usr/local/libexec/drkonqi"
kded(27330)/kdecore (kdelibs) KStandardDirs::findExe: findExe( "kbuildsycoca4" , pstr,  QFlags() ) called
kdeinit4: Got EXT_EXEC '/usr/local/bin/kbuildsycoca4' from launcher.
kdeinit4: preparing to launch /usr/local/lib/libkdeinit4_kbuildsycoca4.so
klauncher(1456)/kio (KLauncher) KLauncher::processRequestReturn: "/usr/local/bin/kbuildsycoca4" (pid 5165) up and running.
kbuildsycoca4 running...
kbuildsycoca4(5165)/kdecore (KSycoca) KSycocaPrivate::openDatabase: Trying to open ksycoca from "/var/tmp/kdecache-pers/ksycoca4"
kbuildsycoca4(5165) KBuildSycoca::checkTimestamps: checking file timestamps
kbuildsycoca4(5165) KBuildSycoca::checkTimestamps: timestamps check ok
kbuildsycoca4(5165) kdemain: Emitting notifyDatabaseChanged ()
kdeinit4: PID 5165 terminated.
kded(27330)/kdecore (KSycoca) KSycocaPrivate::openDatabase: Trying to open ksycoca from "/var/tmp/kdecache-pers/ksycoca4"
klauncher(1456)/kdecore (KSycoca) KSycocaPrivate::openDatabase: Trying to open ksycoca from "/var/tmp/kdecache-pers/ksycoca4"
kdeinit4: Got EXEC_NEW 'kconf_update' from launcher.
unnamed app(19283)/kdecore (kdelibs) KStandardDirs::findExe: findExe( "kconf_update" , pstr,  QFlags() ) called
unnamed app(19283)/kdecore (kdelibs) KStandardDirs::findExe: findExe(): returning libexec item  "/usr/local/libexec/kconf_update"
kdeinit4: preparing to launch /usr/local/lib/libkdeinit4_kconf_update.so
klauncher(1456)/kio (KLauncher) KLauncher::processRequestReturn: "kconf_update" (pid 22647) up and running.
kdeinit4: PID 22647 terminated.
kdeinit4: PID 29327 terminated.
konqueror(14578)/kdecore (KSycoca) KSycocaPrivate::openDatabase: Trying to open ksycoca from "/var/tmp/kdecache-pers/ksycoca4"
Segmentation fault (core dumped) 

BTW, OpenBSD doesn't support shared semaphores; sem_init() returns -1 with EPERM when trying to do this (documented in it's man page too).

Now I want to add trace kDebug() to every method involved, to see what's happening. Could you tell, please, is there any simple method to print backtrace to text console?
Comment 4 Michael Pyne 2012-08-13 22:35:27 UTC
> BTW, OpenBSD doesn't support shared semaphores; sem_init() returns -1 with EPERM when trying to do this (documented in it's man page too).

That's unfortunate, although the code already prefers shared mutexes instead (although I will assume those are also not supported).

> Now I want to add trace kDebug() to every method involved, to see what's happening. Could you tell, please, is there any simple method to print backtrace to text console?

There is a debugging function kBacktrace(), which might work as long as it's supported on OpenBSD. It is used like:

kDebug(ksdcArea()) << kBacktrace();

One other interesting thing to try would be to alter the findBestSharedLock function I mentioned to immediately return the value LOCKTYPE_SPINLOCK and omit any further processing (to force the generic lock to be used).

If you delete the existing .kcache files (should be located in $(kde4-config --path cache) ) and then re-run the new cache files should be created using the generic CPU-wasteful (but process shared) spinlocks. If that works let me know and I can make it the default for OpenBSD while we investigate superior solutions.
Comment 5 Vadim Zhukov 2012-08-14 08:21:42 UTC
> > BTW, OpenBSD doesn't support shared semaphores; sem_init() returns -1 with EPERM when trying to do this (documented in it's man page too).
> 
> That's unfortunate, although the code already prefers shared mutexes instead (although I will assume those are also not supported).

Shared mutexes - not, but shared read-write locks exist; as far as I understand this should be even better. I'll patch the code then.

> There is a debugging function kBacktrace(), which might work as long as it's supported on OpenBSD. It is used like:
> 
> kDebug(ksdcArea()) << kBacktrace();

Thanks a lot!

> One other interesting thing to try would be to alter the findBestSharedLock function I mentioned to immediately return the value LOCKTYPE_SPINLOCK and omit any further processing (to force the generic lock to be used).
> 
> If you delete the existing .kcache files (should be located in $(kde4-config --path cache) ) and then re-run the new cache files should be created using the generic CPU-wasteful (but process shared) spinlocks. If that works let me know and I can make it the default for OpenBSD while we investigate superior solutions.

AFAIK, this is how it worked before switching from userland threads to kernel ones; looks like this is the reason it didn't triggered the problem before and also the reason of kded's appetite (25-30% CPU).

I'll try to go with shared rwlocks and see how it works. Thanks again for your input!
Comment 6 Vadim Zhukov 2012-08-17 08:18:34 UTC
Created attachment 73250 [details]
File inter-process lockingin KSharedDataCache

At first, I should apologize: OpenBSD does not support process-shared rwlocks, I didn't read manual page thoughly. Sorry.

After some thoughts I found another way to implement process-shared locking, using file locking mechanisms. As you probably know, there are 3 more or less generic interfaces: fcntl(), flock() and lockf(). Unfortunately, they may behave differently on different systems. I found this blog post - http://apenwarr.ca/log/?m=201012#13 - quiet interesting. And now there is a new lock type - LOCKTYPE_FILE, implemented either via flock() or (if previouse was not found) lockf(). Actual locking is two-step process: first, we acquire simple thread lock (pthreads mutex), then call flock()/lockf(). Unlock is done in backwards. This way we: 1) avoid situation when single process constantly locks the cache; 2) play safe. :)

But enough words, here is a patch. Both flock() and lockf() paths tested on OpenBSD-CURRENT.
Comment 7 Vadim Zhukov 2012-08-17 08:20:59 UTC
Created attachment 73251 [details]
Konqueror starting output

And here is a sample KDE app output - no crashes as you can see.
Comment 8 Vadim Zhukov 2012-08-17 09:05:33 UTC
Created attachment 73252 [details]
File inter-process lockingin KSharedDataCache, fixed comments

Oops, fixing one comment and one debug printout.
Comment 9 Michael Pyne 2012-08-18 20:08:57 UTC
Created attachment 73282 [details]
Simple spinlock-based fallback lock

Vadim I have to apologize. When I has asked whether forcing the fallback spinlock to enable had helped on OpenBSD, I had assumed I had actually pushed that code out to the KDE repository!

Could you please test and see if this helps? It was actually written for Mac OS X, not OpenBSD, and so might have to go in anyways, and avoids the concern with requiring thread mutexes in addition to the cache lock (which I'm impressed you noticed on the first patch!)

If performance is unacceptable with the spinlock we can look further at using lockf/flock, but I would estimate that most of the time the spinlock would result in less time and maybe even less power usage, since when there's no or little contention we don't have to jump across the user/kernel boundary multiple times to enforce locking. And the article you reference notes issues with seemingly every possible combination of file locking so I'd rather not go that direction if feasible (great article btw)
Comment 10 Vadim Zhukov 2012-08-18 21:36:50 UTC
Thank you for review, Michael.

Yes, spin locks look like a lot nicer. Unfortunately your implementation
makes the code hardware dependent, as those intristincs are supported on
x86 mostly, if I recall correctly. See this bug report for example:
https://bugs.php.net/bug.php?id=55874 . I'll try your patch on OpenBSD
anyway and report about results, it's definitely better than going to file
locking. Maybe there are other better options for spin locks than used in
your patch?
Comment 11 Michael Pyne 2012-08-18 23:01:51 UTC
Hmm, I was assuming it made it GCC (and compatibles) dependent, which is not ideal but better than x86-only.

The alternative I tried at first was to use Qt's own atomic intrinsics (QBasicAtomicInt), but since that type is not considered "Plain Old Data" by C++03 (and therefore needs a constructor), it cannot be made a part of the union type SharedLock.

It could probably be made to work by using a char array of sizeof(QBasicAtomicInt) length and some reinterpret_cast<> hackery but I was trying to avoid that too since I'm not sure I'll avoid undefined behavior.
Comment 12 Vadim Zhukov 2012-08-22 11:04:20 UTC
Michael, I've finally managed to test your diff, sorry for delay.

Some thoughts:

1. Your locks are not actually spinlocks because you do (nano)sleep. But this could lead to situation when other processes lock and unlock cache while you're sleeping, and you never get locked. Probably recreating cache, say, every hour is not a big deal, though. Also, playing with numbers (more but shorter intervals) and/or using pthread_yield() could make it feel better.

2. My locks could fail do to the following reasons:

2a. flock: KDE app forks itself and both processes continue to work simultaneously (without exec'ing any of them into something other). I'm not sure this is supported by KDE Platform at all due to DBus and such.

2b. lockf: KDE app using cache tries to open (and close) cache file on behalf of KSharedDataCache too. Definitely this is not a supported behaviour too.

I have combined both your and mine diffs, adding necessary CMake checks, etc. After some thoughts I came to the following lock preference:

1. Process-shared mutexes or semaphores with timeout support
2. Spin locks
3. File locks with (semi-)timeout support
4. Blocking process-shared mutexes or semaphores
5. Blocking file locks

But before even sending the patch I want to ask about revision f791e74673278325e1d0ba010bf6f1624cdf0edd (http://quickgit.kde.org/index.php?p=kdelibs.git&a=commit&h=f791e74673278325e1d0ba010bf6f1624cdf0edd : why it was ever needed to support semaphores in parallel to mutexes? I cannot imagine an OS with process-shared semaphores but without process-shared mutexes, was this an actual case for FreeBSD? Is it still actual for now? I've added Alberto Villa to CC to resolve this issue, too.

Also, what about moving all those checks to CMake? This will remove a lot of extra code, which is always good. :)
Comment 13 Alberto Villa 2012-08-22 11:39:37 UTC
(In reply to comment #12)
> But before even sending the patch I want to ask about revision
> f791e74673278325e1d0ba010bf6f1624cdf0edd
> (http://quickgit.kde.org/index.php?p=kdelibs.
> git&a=commit&h=f791e74673278325e1d0ba010bf6f1624cdf0edd : why it was ever
> needed to support semaphores in parallel to mutexes? I cannot imagine an OS
> with process-shared semaphores but without process-shared mutexes, was this
> an actual case for FreeBSD? Is it still actual for now? I've added Alberto
> Villa to CC to resolve this issue, too.

Yes, this is an actual case for FreeBSD. If I remember correctly, shared mutexes were added in HEAD, but other supported branches don't have them for sure, so we still rely on semaphores.
Comment 14 Vadim Zhukov 2012-08-22 12:03:40 UTC
Thank you, Alberto, for clarification. Do you know, is it possible to check for process-shared semaphores support other than manually crafting the apporiate test?

More thoughts about KSharedDataCache itself (continuing from my previous comment):

3. Possibly KDE already relies on shared memory support in mmap (in other KDE modules)? I don't know. We can skip some checks in kshareddatacache.cpp then.

4. And in case of file locks we do not need shared memory at all, just simple mapping, which is good - shared memory, AFAIK, is expensive in many *nix OS, including BSD.

5. KSDC_THREAD_PROCESS_SHARED_SUPPORTED definitely needs renaming. :)
Comment 15 Vadim Zhukov 2012-08-22 13:02:16 UTC
> Thank you, Alberto, for clarification. Do you know, is it possible to check for process-shared semaphores support other than manually crafting the apporiate test?

Sorry, I forgot that sem_init() call actually contains "shared" argument and immediatly fails if process-shared semaphores are not supported. So nevermind. :)
Comment 16 Vadim Zhukov 2012-08-22 19:44:49 UTC
Created attachment 73397 [details]
KShareDataCache locking overhaul

This is a single patch combining all my experiments on the KSharedDataCache code till now. Quick list of changes:

- Move most checks to compile time. Makes things both more portable (as some implementations may get sysconf definitions wrong) and more strict (e.g., avoids the problem raised this thread initially - invalid treatment of semaphores as always process-shared ones).

- Remove support for thread-shared locks: no point in using them as there are always a few other KDE processes around anyway; at least, kdeinit one. It's better to not have cache at all than to have apps crashing.

- Fix bug in posix_fallocate() usage (probably should go in maintained stable branches as well), see the last chunk in the diff for kdecore/util/kshareddatacache_p.h.

- Implement spin locks, slightly modified version of Michael's patch.

- Implement file locks, with optional timeouts support for thread-shared part.

- Minor style nits (comments after "#endif" etc.).

If this patch will be considered OK, I can split it in parts and send individually.
Thanks in advance for review.
Comment 17 Vadim Zhukov 2012-08-22 19:57:47 UTC
Created attachment 73398 [details]
KSharedDataCache locking overhaul v1.1

Oops, wrong patch sent, sorry.
Comment 18 Michael Pyne 2012-08-25 21:18:17 UTC
Vadim,

As you've surely noticed I haven't had time to start on the review yet. I will be looking at that tonight however, just wanted to let you know it hasn't been dropped.
Comment 19 Michael Pyne 2012-08-26 04:46:48 UTC
Having had a chance to look through a bit I will ask that it be split up, and I have the following comments.

First off, I love the work done getting KSharedDataCache to use CMake properly instead of using POSIX feature detection. I should have done the same a long time ago. I would ask that #cmakedefine01 be used instead of #cmakedefine as that seems to be considered best practice now. As far as what to look for, the first priority is a process-shared primitive which can be kept in SHM, and secondly that there be some capability for timeouts (I agree wholeheartedly that crashing is bad, but freezing the entire desktop because of a corrupt cache is just as annoying)

Assuming we fix the issue about GCC intrinsics and can therefore use the spinlock I agree with removing the code handling non-process-shared primitives (I will work on a variant using Qt's atomic primitives tomorrow). As before I would still prefer having a very generic spinlock but if that is not feasible with OpenBSD's system compiler then I will test the file lock waters for OpenBSD only.

I also agree with making an enum for the wait timeout, and using it consistently.

>      {
> -        // Spin a few times attempting to gain the lock, as upper-level code won't
> -        // attempt again without assuming the cache is corrupt.
> -        for (unsigned i = 50; i > 0; --i) {
> +#ifdef HAVE_MONOTONIC_CLOCK
> +        struct timespec start, t;
> +
> +        if (::clock_gettime(CLOCK_MONOTONIC, &start) == -1) {
> +            kError(ksdcArea()) << "clock_gettime(CLOCK_MONOTONIC, ...) failed, errno:" << errno;
> +            return false;
> +        }
> +        do {
>              if (__sync_lock_test_and_set(&m_spinlock, 1) == 0) {
>                  return true;
>              }
>  
> -            struct timespec wait_time = { 0 /* sec */, 300 + 100 * i * i /* ns */ };
> -            ::nanosleep(&wait_time, static_cast<struct timespec*>(0));
> -        }
> +            if (::clock_gettime(CLOCK_MONOTONIC, &t) == -1) {
> +                kError(ksdcArea()) << "clock_gettime(CLOCK_MONOTONIC, ...) failed, errno:" << errno;
> +                return false;
> +            }
> +	} while (t.tv_sec <= start.tv_sec - KSDC_WAIT_TIMEOUT && t.tv_nsec < start.tv_nsec);
> +#else
> +        // Monotonic clock is not supported, KSharedDataCache could misbehave
> +        // on time change. Not critical, though, so no warning here.
> +        time_t start = ::time(NULL);
> +        int i = 1;
>  
> +	do {
> +            if (__sync_lock_test_and_set(&m_spinlock, 1) == 0) {
> +                return true;
> +            }
> +        } while (::time(NULL) < start + KSDC_WAIT_TIMEOUT);
> +#endif
> +        kDebug(ksdcArea()) << "spinlock timed out";
>          return false;
>      }

This code changes from a loop using nanosleep(2) with a TCP-style exponential fallback (and eventually giving up) to a loop that tries to grab the lock as frequently as possible up to a certain time limit. While you are right to use a monotonic clock if this is how you choose to implement the timeout, the choice of nanosleep(2) was deliberate for how it is defined to interact with signals, and with how it handles the time parameter (it is interval-based and this makes it easier to use increasing intervals). The concern is that the process trying to gain the lock may spin so fast that it starves the process that must run before it can release the lock. We certainly can play around with how the backoff is determined (I've not been able to do much contention testing obviously) but I'd rather use this method unless perhaps OpenBSD does not support it. One thing I see after looking at the man page is that we could restart an interrupted call, but even would is done now would still work.

> -        return pthread_mutex_timedlock(&m_mutex, &timeout) >= 0;
> +        rc = pthread_mutex_timedlock(&m_mutex, &timeout);
> +        if (rc == -1)
> +            kError(ksdcArea()) << "pthread_mutex_timedlock failed, errno:" << rc;
> +            return false;
> +        }
> +        return true;

The man pages I have checked indicate that pthread_mutex_timedlock(3) does not follow the standard POSIX convention of using errno, instead errors are returned directly. Of course, that is not very apparently from my current code where I'm assuming return values >0 are also not errors, something which is not guaranteed. So that would be a bug I must fix.

On the other hand, sem_*(3) does follow the POSIX convention so your changes to add error messages are fine (except that I would prefer that strerror(errno) be passed to debug and not just the error code).

> -        char unused[64];
> +        char unused[64 - sizeof(int)];

Double-check this... it should be unnecessary to adjust the size of 'unused' as being in a union should make the entire anonymous union 64 bytes (your change would make the union 60 bytes on most systems). Either way this would be a binary-incompatible change and so should be elided for now.

> +    // posix_fallocate is used to ensure that the file used for the cache is
> +    // actually fully committed to disk before attempting to use the file.
>      int result;
> -    while ((result = ::posix_fallocate(fd, 0, fileSize)) == EINTR) {
> -        ;
> -    }
> +    do {
> +        result = ::posix_fallocate(fd, 0, fileSize);
> +    } while (result == -1 && errno == EINTR);

For similar reasons to above, this change is actually incorrect as posix_fallocate(3) returns error values directly, it does not use errno.
Comment 20 Vadim Zhukov 2012-08-26 11:55:50 UTC
2012/8/26 Michael Pyne <mpyne@kde.org>:
> Assuming we fix the issue about GCC intrinsics and can therefore use the
> spinlock I agree with removing the code handling non-process-shared primitives
> (I will work on a variant using Qt's atomic primitives tomorrow). As before I
> would still prefer having a very generic spinlock but if that is not feasible
> with OpenBSD's system compiler then I will test the file lock waters for
> OpenBSD only.

OpenBSD uses GCC 4.2.1 FWIW and will not move to newer version due to
famous license change. CLang path is being investigated; also, Bitrig
fork was created recently that has migration to LLVM as one of main
goals. FreeBSD is moving to CLang too...

>>      {
>> -        // Spin a few times attempting to gain the lock, as upper-level code won't
>> -        // attempt again without assuming the cache is corrupt.
>> -        for (unsigned i = 50; i > 0; --i) {
>> +#ifdef HAVE_MONOTONIC_CLOCK
>> +        struct timespec start, t;
>> +
>> +        if (::clock_gettime(CLOCK_MONOTONIC, &start) == -1) {
>> +            kError(ksdcArea()) << "clock_gettime(CLOCK_MONOTONIC, ...) failed, errno:" << errno;
>> +            return false;
>> +        }
>> +        do {
>>              if (__sync_lock_test_and_set(&m_spinlock, 1) == 0) {
>>                  return true;
>>              }
>>
>> -            struct timespec wait_time = { 0 /* sec */, 300 + 100 * i * i /* ns */ };
>> -            ::nanosleep(&wait_time, static_cast<struct timespec*>(0));
>> -        }
>> +            if (::clock_gettime(CLOCK_MONOTONIC, &t) == -1) {
>> +                kError(ksdcArea()) << "clock_gettime(CLOCK_MONOTONIC, ...) failed, errno:" << errno;
>> +                return false;
>> +            }
>> +     } while (t.tv_sec <= start.tv_sec - KSDC_WAIT_TIMEOUT && t.tv_nsec < start.tv_nsec);
>> +#else
>> +        // Monotonic clock is not supported, KSharedDataCache could misbehave
>> +        // on time change. Not critical, though, so no warning here.
>> +        time_t start = ::time(NULL);
>> +        int i = 1;
>>
>> +     do {
>> +            if (__sync_lock_test_and_set(&m_spinlock, 1) == 0) {
>> +                return true;
>> +            }
>> +        } while (::time(NULL) < start + KSDC_WAIT_TIMEOUT);
>> +#endif
>> +        kDebug(ksdcArea()) << "spinlock timed out";
>>          return false;
>>      }
>
> This code changes from a loop using nanosleep(2) with a TCP-style exponential
> fallback (and eventually giving up) to a loop that tries to grab the lock as
> frequently as possible up to a certain time limit. While you are right to use a
> monotonic clock if this is how you choose to implement the timeout, the choice
> of nanosleep(2) was deliberate for how it is defined to interact with signals,
> and with how it handles the time parameter (it is interval-based and this makes
> it easier to use increasing intervals). The concern is that the process trying
> to gain the lock may spin so fast that it starves the process that must run
> before it can release the lock.

Well, this could be true for OS with cooperative scheduling. But all
modern generic usage OSes do preemptive scheduling, so other processes
will have a lot of time to work with cache. But words are cheap...
I'll try to run some stability tests with and without sleeping and
report here.

> We certainly can play around with how the
> backoff is determined (I've not been able to do much contention testing
> obviously) but I'd rather use this method unless perhaps OpenBSD does not
> support it. One thing I see after looking at the man page is that we could
> restart an interrupted call, but even would is done now would still work.
>
>> -        return pthread_mutex_timedlock(&m_mutex, &timeout) >= 0;
>> +        rc = pthread_mutex_timedlock(&m_mutex, &timeout);
>> +        if (rc == -1)
>> +            kError(ksdcArea()) << "pthread_mutex_timedlock failed, errno:" << rc;
>> +            return false;
>> +        }
>> +        return true;
>
> The man pages I have checked indicate that pthread_mutex_timedlock(3) does not
> follow the standard POSIX convention of using errno, instead errors are
> returned directly. Of course, that is not very apparently from my current code
> where I'm assuming return values >0 are also not errors, something which is not
> guaranteed. So that would be a bug I must fix.

Yep, I've already detected this bug in my code. I'm splitting the
patch into logical chunks, and hope to publish them on review board
soon.

> On the other hand, sem_*(3) does follow the POSIX convention so your changes to
> add error messages are fine (except that I would prefer that strerror(errno) be
> passed to debug and not just the error code).
>
>> -        char unused[64];
>> +        char unused[64 - sizeof(int)];
>
> Double-check this... it should be unnecessary to adjust the size of 'unused' as
> being in a union should make the entire anonymous union 64 bytes (your change
> would make the union 60 bytes on most systems). Either way this would be a
> binary-incompatible change and so should be elided for now.

Yes, I've got it wrong.

>> +    // posix_fallocate is used to ensure that the file used for the cache is
>> +    // actually fully committed to disk before attempting to use the file.
>>      int result;
>> -    while ((result = ::posix_fallocate(fd, 0, fileSize)) == EINTR) {
>> -        ;
>> -    }
>> +    do {
>> +        result = ::posix_fallocate(fd, 0, fileSize);
>> +    } while (result == -1 && errno == EINTR);
>
> For similar reasons to above, this change is actually incorrect as
> posix_fallocate(3) returns error values directly, it does not use errno.

Oops. Somehow I read
http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
incorrectly. Sorry for that part.

Thank you a lot for review, Michael!
Comment 21 Vadim Zhukov 2012-08-26 19:10:27 UTC
I've added three review requests: https://git.reviewboard.kde.org/r/106174/ and https://git.reviewboard.kde.org/r/106221/ and https://git.reviewboard.kde.org/r/106224/ . No file locking patch there, sorry; gonna come up with it a bit later.
Comment 22 Michael Pyne 2012-08-26 19:34:51 UTC
Git commit d7b0699df144b377afa50de2fabc11f02b2831e7 by Michael Pyne.
Committed on 26/08/2012 at 21:26.
Pushed by mpyne into branch 'KDE/4.9'.

kshareddatacache: Use proper return value check for timedlock.

I discovered that the return value was being checked as if
pthread_mutex_timedlock used errno for return values while doing a code
review for bug 305023. In reality the only non-error return is 0,
anything else is an error code which is returned directly.

I'm making the fix now so it can join KDE Platform 4.9.1.

M  +1    -1    kdecore/util/kshareddatacache_p.h

http://commits.kde.org/kdelibs/d7b0699df144b377afa50de2fabc11f02b2831e7
Comment 23 Michael Pyne 2012-09-03 22:30:14 UTC
This *particular* issue should be fixed by commit 692d5bb79649cbfb5290089e585708d5dd1b30e2, although I marked the wrong bug as fixed by mistake.