434926 – Crash in Baloo::IdFilenameDB::get() after Baloo::DocumentUrlDB::get

Bug 434926 - Crash in Baloo::IdFilenameDB::get() after Baloo::DocumentUrlDB::get

Summary: Crash in Baloo::IdFilenameDB::get() after Baloo::DocumentUrlDB::get

Status:	CONFIRMED

Alias:	None

Product:	frameworks-baloo
Classification:	Frameworks and Libraries
Component:	Baloo File Daemon (show other bugs)
Version:	5.80.0
Platform:	Arch Linux Linux

Importance:	NOR crash
Target Milestone:	---
Assignee:	Stefan Brüns

URL:
Keywords:

Duplicates (3):	433980 450722 461096 (view as bug list)
Depends on:
Blocks:

Reported:	2021-03-25 11:53 UTC by a
Modified:	2022-10-28 11:30 UTC (History)
CC List:	9 users (show)

See Also:	372880
Latest Commit:
Version Fixed In:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description a 2021-03-25 11:53:41 UTC

SUMMARY

Crash at boot

STEPS TO REPRODUCE
1. Launch KDE

OBSERVED RESULT

```
systemd-coredump[2267]: Process 964 (baloo_file) of user 1000 dumped core.
                                               
                                               Stack trace of thread 1339:
                                               #0  0x00007f72a74dd396 n/a (liblmdb.so + 0x4396)
                                               #1  0x00007f72a74dfefe n/a (liblmdb.so + 0x6efe)
                                               #2  0x00007f72a74e0644 n/a (liblmdb.so + 0x7644)
                                               #3  0x00007f72a74e0c50 mdb_get (liblmdb.so + 0x7c50)
                                               #4  0x00007f72a89bc355 _ZN5Baloo12IdFilenameDB3getEy (libKF5BalooEngine.so.5 + 0x153>
                                               #5  0x00007f72a89b63ac _ZNK5Baloo13DocumentUrlDB3getEy (libKF5BalooEngine.so.5 + 0xf>
                                               #6  0x00007f72a89c7135 _ZNK5Baloo11Transaction11documentUrlEy (libKF5BalooEngine.so.>
                                               #7  0x00005612b8c95e44 n/a (baloo_file + 0x1de44)
                                               #8  0x00007f72a89ce59f _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #9  0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #10 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #11 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #12 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #13 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #14 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #15 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #16 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #17 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #18 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #19 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #20 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #21 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #22 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #23 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #24 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #25 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #26 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #27 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #28 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #29 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #30 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #31 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #32 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #33 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #34 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #35 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #36 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #37 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #38 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #39 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #40 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #41 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #42 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #43 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #44 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #45 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #46 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #47 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #48 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #49 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #50 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #51 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #52 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #53 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #54 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #55 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #56 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #57 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #58 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #59 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #60 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #61 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #62 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               #63 0x00007f72a89ce5f7 _ZN5Baloo16WriteTransaction17removeRecursivelyEyRKSt8function>
                                               
                                               Stack trace of thread 964:
                                               #0  0x00007f72a805537f __poll (libc.so.6 + 0xf437f)
                                               #1  0x00007f72a6ce69d8 n/a (libglib-2.0.so.0 + 0xa79d8)
                                               #2  0x00007f72a6c906f1 g_main_context_iteration (libglib-2.0.so.0 + 0x516f1)
                                               #3  0x00007f72a875e691 _ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoo>
                                               #4  0x00007f72a87043ac _ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE (libQt5>
                                               #5  0x00007f72a870c844 _ZN16QCoreApplication4execEv (libQt5Core.so.5 + 0x2bc844)
                                               #6  0x00005612b8c852b4 n/a (baloo_file + 0xd2b4)
                                               #7  0x00007f72a7f88b25 __libc_start_main (libc.so.6 + 0x27b25)
                                               #8  0x00005612b8c854be n/a (baloo_file + 0xd4be)
                                               
                                               Stack trace of thread 975:
                                               #0  0x00007f72a805537f __poll (libc.so.6 + 0xf437f)
                                               #1  0x00007f72a6ce69d8 n/a (libglib-2.0.so.0 + 0xa79d8)
                                               #2  0x00007f72a6c906f1 g_main_context_iteration (libglib-2.0.so.0 + 0x516f1)
                                               #3  0x00007f72a875e691 _ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoo>
                                               #4  0x00007f72a87043ac _ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE (libQt5>
                                               #5  0x00007f72a851cd12 _ZN7QThread4execEv (libQt5Core.so.5 + 0xccd12)
                                               #6  0x00007f72a89f2098 n/a (libQt5DBus.so.5 + 0x17098)
                                               #7  0x00007f72a851deff n/a (libQt5Core.so.5 + 0xcdeff)
                                               #8  0x00007f72a754f299 start_thread (libpthread.so.0 + 0x9299)
                                               #9  0x00007f72a8060053 __clone (libc.so.6 + 0xff053)
                                               
                                               Stack trace of thread 978:
                                               #0  0x00007f72a8057bf3 pselect (libc.so.6 + 0xf6bf3)
                                               #1  0x00007f72a6720524 n/a (libusbmuxd-2.0.so.6 + 0x2524)
                                               #2  0x00007f72a67218a9 n/a (libusbmuxd-2.0.so.6 + 0x38a9)
                                               #3  0x00007f72a754f299 start_thread (libpthread.so.0 + 0x9299)
                                               #4  0x00007f72a8060053 __clone (libc.so.6 + 0xff053)

```

EXPECTED RESULT

File results in Dolphin searches

SOFTWARE/OS VERSIONS
Arch all latest

ADDITIONAL INFORMATION

Comment 1 Nate Graham 2021-03-25 18:29:53 UTC

See also Bug 372880 which has a similar but slightly different backtrace.

Comment 2 Bernie Innocenti 2021-12-04 20:21:35 UTC

I can reproduce this crash with baloo built out of git.
Happens every time I launch baloo_file:

[New Thread 0x7ffff16c5640 (LWP 1852)]

Thread 4 "Thread (pooled)" received signal SIGBUS, Bus error.
[Switching to Thread 0x7ffff16c5640 (LWP 1852)]
0x00007ffff75db86a in ?? () from /usr/lib/liblmdb.so
(gdb) bt
#0  0x00007ffff75db86a in ?? () from /usr/lib/liblmdb.so
#1  0x00007ffff75dec40 in ?? () from /usr/lib/liblmdb.so
#2  0x00007ffff75df644 in ?? () from /usr/lib/liblmdb.so
#3  0x00007ffff75dfc50 in mdb_get () from /usr/lib/liblmdb.so
#4  0x00007ffff7cdbe41 in Baloo::IdFilenameDB::get (this=0x7ffff16c46d0, docId=562640780800)
    at /home/bernie/kde/src/baloo/src/engine/idfilenamedb.cpp:83
#5  0x00007ffff7cd10e6 in Baloo::DocumentUrlDB::get (this=0x7ffff16c4770, docId=562640780800)
    at /home/bernie/kde/src/baloo/src/engine/documenturldb.cpp:172
#6  0x00007ffff7ceb572 in Baloo::Transaction::documentUrl (this=0x7ffff16c4a40, id=562640780800)
    at /home/bernie/kde/src/baloo/src/engine/transaction.cpp:102
#7  0x00005555555819d5 in operator() (__closure=0x7fbfe8004eb0, id=562640780800)
    at /home/bernie/kde/src/baloo/src/file/indexcleaner.cpp:40
#8  0x0000555555582372 in std::__invoke_impl<bool, Baloo::IndexCleaner::run()::<lambda(quint64)>&, long long unsigned int>(std::__invoke_other, struct {...} &) (__f=@0x7fbfe8004eb0: {__tr = @0x7ffff16c4a40, __this = 0x5555556b1c50, __mimeDb = @0x7ffff16c49c0})
    at /usr/include/c++/11.1.0/bits/invoke.h:61
#9  0x0000555555582254 in std::__invoke_r<bool, Baloo::IndexCleaner::run()::<lambda(quint64)>&, long long unsigned int>(struct {...} &) (
    __fn=@0x7fbfe8004eb0: {__tr = @0x7ffff16c4a40, __this = 0x5555556b1c50, __mimeDb = @0x7ffff16c49c0})
    at /usr/include/c++/11.1.0/bits/invoke.h:114
#10 0x000055555558211e in std::_Function_handler<bool(long long unsigned int), Baloo::IndexCleaner::run()::<lambda(quint64)> >::_M_invoke(const std::_Any_data &, unsigned long long &&) (__functor=
      @0x7ffff16c4a60: {_M_unused = {_M_object = 0x7fbfe8004eb0, _M_const_object = 0x7fbfe8004eb0, _M_function_pointer = 0x7fbfe8004eb0, _M_member_pointer = (void (std::_Undefined_class::*)(std::_Undefined_class * const)) 0x7fbfe8004eb0, this adjustment 15}, _M_pod_data = "\260N\000\350\277\177\000\000\017\000\000\000\000\000\000"}, __args#0=@0x7ffff16c48b0: 562640780800)
    at /usr/include/c++/11.1.0/bits/std_function.h:291
#11 0x00007ffff7cf4ebb in std::function<bool (unsigned long long)>::operator()(unsigned long long) const (this=0x7ffff16c4a60, 
    __args#0=562640780800) at /usr/include/c++/11.1.0/bits/std_function.h:560
#12 0x00007ffff7cf316e in Baloo::WriteTransaction::removeRecursively(unsigned long long, std::function<bool (unsigned long long)> const&)
    (this=0x7fbfe8004e10, parentId=562640780800, shouldDelete=
      @0x7ffff16c4a60: {<std::_Maybe_unary_or_binary_function<bool, unsigned long long>> = {<std::unary_function<unsigned long long, bool>> = {<No data fields>}, <No data fields>}, <std::_Function_base> = {static _M_max_size = 16, static _M_max_align = 8, _M_functor = {_M_unused = {_M_object = 0x7fbfe8004eb0, _M_const_object = 0x7fbfe8004eb0, _M_function_pointer = 0x7fbfe8004eb0, _M_member_pointer = (void (std::_Undefined_class::*)(std::_Undefined_class * const)) 0x7fbfe8004eb0, this adjustment 15}, _M_pod_data = "\260N\000\350\277\177\000\000\017\000\000\000\000\000\000"}, _M_manager = 0x555555582124 <std::_Function_handler<bool(long long unsigned int), Baloo::IndexCleaner::run()::<lambda(quint64)> >::_M_manager(std::_Any_data &, const std::_Any_data &, std::_Manager_operation)>}, _M_invoker = 0x5555555820e7 <std::_Function_handler<bool(long long unsigned int), Baloo::IndexCleaner::run()::<lambda(quint64)> >::_M_invoke(const std::_Any_data &, unsigned long long &&)>}) at /home/bernie/kde/src/baloo/src/engine/writetransaction.cpp:160
#13 0x0000555555582517 in Baloo::Transaction::removeRecursively(unsigned long long, std::function<bool (unsigned long long)>) (
    this=0x7ffff16c4a40, parentId=562640780800, shouldDelete=
      {<std::_Maybe_unary_or_binary_function<bool, unsigned long long>> = {<std::unary_function<unsigned long long, bool>> = {<No data fields>}, <No data fields>}, <std::_Function_base> = {static _M_max_size = 16, static _M_max_align = 8, _M_functor = {_M_unused = {_M_object = 0x7fbfe8004eb0, _M_const_object = 0x7fbfe8004eb0, _M_function_pointer = 0x7fbfe8004eb0, _M_member_pointer = (void (std::_Undefined_class::*)(std::_Undefined_class * const)) 0x7fbfe8004eb0, this adjustment 15}, _M_pod_data = "\260N\000\350\277\177\000\000\017\000\000\000\000\000\000"}, _M_manager = 0x555555582124 <std::_Function_handler<bool(long long unsigned int), Baloo::IndexCleaner::run()::<lambda(quint64)> >::_M_manager(std::_Any_data &, const std::_Any_data &, std::_Manager_operation)>}, _M_invoker = 0x5555555820e7 <std::_Function_handler<bool(long long unsigned int), Baloo::IndexCleaner::run()::<lambda(quint64)> >::_M_invoke(const std::_Any_data &, unsigned long long &&)>})
    at /home/bernie/kde/src/baloo/src/engine/transaction.h:101
#14 0x0000555555581e2b in Baloo::IndexCleaner::run (this=0x5555556b1c50) at /home/bernie/kde/src/baloo/src/file/indexcleaner.cpp:66
#15 0x00007ffff76c1332 in ?? () from /usr/lib/libQt5Core.so.5
#16 0x00007ffff76be02f in ?? () from /usr/lib/libQt5Core.so.5
#17 0x00007ffff601a259 in start_thread () from /usr/lib/libpthread.so.0
#18 0x00007ffff71935e3 in clone () from /usr/lib/libc.so.6
(gdb) 

Trying to figure out how to print QStrings from gdb to find out which file triggers the crash...

Comment 3 Bernie Innocenti 2021-12-04 21:13:21 UTC

In case it helps, I was able to build liblmdb with debug symbols and find the exact point where the crash occurs:

#0  0x00007ffff75db86a in mdb_node_search (mc=mc@entry=0x7ffff16c4240, key=key@entry=0x7ffff16c4620, exactp=exactp@entry=0x7ffff16c415c) at mdb.c:5341
 5338                    while (low <= high) {
 5339                            i = (low + high) >> 1;
 5340
 5341   HERE ->           node = NODEPTR(mp, i);
 5342                            nodekey.mv_size = NODEKSZ(node);
 5343                            nodekey.mv_data = NODEKEY(node);
 5344
 5345                            rc = cmp(key, &nodekey);

(gdb) p i
$6 = 0
(gdb) p *mp
$7 = {
  mp_p = {
    p_pgno = 42014499,
    p_next = 0x2811723
  },
  mp_pad = 5922,
  mp_flags = 641,
  mp_pb = {
    pb = {
      pb_lower = 0,
      pb_upper = 0
    },
    pb_pages = 0
  },
  mp_ptrs = {5921}
}

This seems suspicious:

(gdb) p low
$8 = 1
(gdb) p high
$9 = 2147483639

How can we get i=0 from "i = (low + high) >> 1" at line 5339?

Comment 4 Nate Graham 2021-12-05 04:13:15 UTC

Do you think it's an lmdb issue?

Comment 5 Bernie Innocenti 2021-12-06 01:27:34 UTC

(In reply to Nate Graham from comment #4)
> Do you think it's an lmdb issue?

I couldn't exclude it, but lmdb hasn't been updated recently on Arch, and the problem started when I refreshed my kdesrc-build install.

Could be a race condition, since the crash happens shortly after thread 4 starts. At the time of the crash, the other threads are not running in lmdb, but they might have caused corruption earlier.

I tried rebuilding lmdb with DPRINTF enabled, but it didn't print anything. Any clues from people who are more familiar with this library?

Comment 6 Bernie Innocenti 2021-12-06 01:33:52 UTC

Perhaps relevant (but only if Baloo uses LMDB from multiple threads):

  Threads and Processes
  LMDB uses POSIX locks on files, and these locks have issues if one process opens a file multiple times. Because of this, do not mdb_env_open() a file multiple times from a single process. Instead, share the LMDB environment that has opened the file across all threads. Otherwise, if a single process opens the same environment multiple times, closing it once will remove all the locks held on it, and the other instances will be vulnerable to corruption from other processes.
  Also note that a transaction is tied to one thread by default using Thread Local Storage. If you want to pass read-only transactions across threads, you can use the MDB_NOTLS option on the environment.

Source: http://www.lmdb.tech/doc/starting.html

Comment 7 Nate Graham 2022-02-23 20:54:11 UTC

*** Bug 450722 has been marked as a duplicate of this bug. ***

Comment 8 Nate Graham 2022-02-23 20:54:16 UTC

*** Bug 433980 has been marked as a duplicate of this bug. ***

Comment 9 nyanpasu64 2022-06-23 00:33:02 UTC

I've been getting constant baloo crashes myself too, but within the last few weeks it's started happening more often (every time I searched in the application launcher or similar).

To debug, I ran baloo_file under rr, and traced the resulting crash using Pernosco. (Sorry, I don't feel comfortable sharing the URL since the trace contains filesystem paths.)

Oddly baloo_file's main thread spawns a worker thread and a child process (which itself spawns a worker thread). Then the parent process's worker thread crashes (taking the main thread with it), while the child process continues running in the background like a daemon (not sure exactly what happens, it may itself die at a later time?). I don't see any thread-unsafety related to this crash.

The crash backtrace is:

```
(pernosco) bt 
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=7, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f48bc1253d3 in __pthread_kill_internal (signo=7, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007f48bc0d5838 in __GI_raise (sig=7) at ../sysdeps/posix/raise.c:26
#3  0x00007f48bcbfe384 in KCrash::defaultCrashHandler(int) () from /sysroot/usr/lib/libKF5Crash.so.5
#4  <signal handler called>
#5  0x00007f48bb641884 in mdb_node_search (mc=mc@entry=0x7f48b5ecd380, key=key@entry=0x7f48b5ecd760, exactp=exactp@entry=0x7f48b5ecd37c) at mdb.c:5341
#6  0x00007f48bb64560f in mdb_cursor_set (mc=mc@entry=0x7f48b5ecd380, key=key@entry=0x7f48b5ecd760, data=data@entry=0x7f48b5ecd750, op=op@entry=MDB_SET, exactp=exactp@entry=0x7f48b5ecd37c) at mdb.c:6157
#7  0x00007f48bb645bcf in mdb_get (txn=<optimized out>, dbi=<optimized out>, key=0x7f48b5ecd760, data=0x7f48b5ecd750) at mdb.c:5812
#8  0x00007f48bcaf22fc in Baloo::DocumentTimeDB::get (this=<optimized out>, docId=<optimized out>) at /usr/src/debug/baloo-5.95.0/src/engine/documenttimedb.cpp:76
#9  0x00007f48bcb01aff in Baloo::Transaction::documentTimeInfo (this=<optimized out>, id=id@entry=72147491998400538) at /usr/src/debug/baloo-5.95.0/src/engine/transaction.cpp:133
#10 0x000056133285052c in Baloo::UnIndexedFileIterator::shouldIndex (filePath=..., this=0x7f48b5ecd8f0) at /usr/src/debug/baloo-5.95.0/src/file/unindexedfileiterator.cpp:83
#11 Baloo::UnIndexedFileIterator::next (this=<optimized out>) at /usr/src/debug/baloo-5.95.0/src/file/unindexedfileiterator.cpp:64
#12 Baloo::UnindexedFileIndexer::run (this=0x5613341a59a0) at /usr/src/debug/baloo-5.95.0/src/file/unindexedfileindexer.cpp:36
#13 0x00007f48bc6a9291 in QThreadPoolThread::run (this=0x5613345491e0) at thread/qthreadpool.cpp:100
#14 0x00007f48bc6a538a in QThreadPrivate::start (arg=0x5613345491e0) at thread/qthread_unix.cpp:331
#15 0x00007f48bc12354d in start_thread (arg=<optimized out>) at pthread_create.c:442
#16 0x00007f48bc1a8874 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
```

The causality of the bug is:

- fd = mdb_fopen("/home/nyanpasu64/.local/share/baloo/index")
- ...env->me_map = mmap(addr, env->me_mapsize, prot, mmap_flags, env->me_fd (=fd), 0);
- After many successful mdb_node_search() calls comes a failed call. mdb_node_search() calls nkeys = NUMKEYS(mp), which expands to ((mp->mp_pb.pb.pb_lower - (PAGEHDRSZ-PAGEBASE)) >> 1). mp->mp_pb.pb.pb_lower is (uint16_t)0. It should not be 0 since it's subtracted from. PAGEHDRSZ and PAGEBASE are unsigned (uint32_t), so the result is computed as uint32_t (close to 2^32), then right-shifted by 1 (close to 2^31). This value is invalid and causes LMDB mdb_node_search() to crash (I haven't traced exactly how).
    - According to Pernosco, mp points within the above mmap() call.
    - https://stackoverflow.com/q/2089167 says "SIGBUS can happen in Linux for quite a few reasons other than memory alignment faults - for example, if you attempt to access an mmap region beyond the end of the mapped file."

If Pernosco is correct, my guess is that this is a symptom of a corrupt Baloo index holding invalid data, and LMDB memory-maps it but doesn't properly check for corrupted data inside. And my assumption is that the various different Baloo crashes are caused by databases corrupted in different ways (but both Bernie Innocenti and my crash boil down to mdb_node_search() in the end), all with inadequate error checking.

- The immediate workaround is to delete (or rename or trash) ~/.local/share/baloo/index. I don't know *how* the Baloo index got corrupted in the first place though.
- Should LMDB perform more thorough error-checking in mdb_node_search() and possibly other functions, and return a "corrupted database" error rather than SIGBUS?

Comment 10 tagwerk19 2022-06-23 08:36:44 UTC

(In reply to nyanpasu64 from comment #9)
> ... every time I searched in the application launcher or similar
I'll admit I have a very naive knowledge of "the internals"...

My understanding is when something is searching it opens the index "read-only" (and many things can open the index for reading concurrently). When updating the index, baloo_file opens "read-write" (and just the one thing should open the index "read write" at one time). If baloo_file sees the index opened for reading, it appends its writes to the index.

Baloo_file crashes when it is wanting to update the index and sees it is (or has been) opened read-only?

Comment 11 nyanpasu64 2022-06-24 15:58:56 UTC

(In reply to tagwerk19 from comment #10)
> My understanding is when something is searching it opens the index
> "read-only" (and many things can open the index for reading concurrently).
> When updating the index, baloo_file opens "read-write" (and just the one
> thing should open the index "read write" at one time). If baloo_file sees
> the index opened for reading, it appends its writes to the index.

That's plausible.

> Baloo_file crashes when it is wanting to update the index and sees it is (or
> has been) opened read-only?

I think baloo_file crashes when it asks LMDB to read existing data from disk, it's malformed, and LMDB negative-overflows an unsigned int and (I think) performs an out-of-bounds pointer access. The pointer actually points within the 256 gigabytes that baloo_file/LMDB mmaps, but past the end of the underlying file, so the thread (and process) dies with SIGBUS.

## Do other processes write to the database? (unknown)

I don't know who wrote the corrupted file. Is baloo_file the only process to write to the database? I don't know, and I'll probably find out from reading compiler errors, once I try ripping LMDB out of Baloo and replacing with SQLite to evaluate the performance differences. It's probably going to be more reliable; SQLite is known for robustness, whereas LMDB and other mmap-based databases (and Baloo's usage of it) are known to corrupt easily.

## Does baloo_file open LMDB in an unsafe mode? (no)

It appears that for writable databases, Baloo calls mdb_env_open() with MDB_NOSUBDIR | MDB_NOMEMINIT (https://invent.kde.org/frameworks/baloo/-/blob/master/src/engine/database.cpp#L123-131). Reading https://lmdb.readthedocs.io/en/release/#lmdb.Environment and http://www.lmdb.tech/doc/group__mdb.html#ga32a193c6bf4d7d5c5d579e71f22e9340, MDB_NOSYNC is known to cause lost or corrupted data but we do not pass it, and MDB_WRITEMAP also has some tradeoffs but we don't pass it either. So baloo_file is not operating LMDB in a mode where it's known to corrupt data. (I don't know if any other processes *are*.)

## Does baloo_file make threading errors? (unknown)

Reading http://www.lmdb.tech/doc/group__mdb.html#ga32a193c6bf4d7d5c5d579e71f22e9340, if you don't pass MDB_NOTLS, each thread is only allowed to create one read transaction at a time, and you're not allowed to pass transactions between threads. Reading https://lmdb.readthedocs.io/en/release/#threads, even if you do pass it, a write transaction cannot be passed between threads, and Cursor is scoped within a transaction or something.

baloo_file's main thread initializes a lazy-init static Baloo::Database through Baloo::globalDatabaseInstance() (which never gets freed, so no use-after-free), then opens the database (main -> Baloo::Database::open -> mdb_env_open()), and calls ...qt -> FileIndexScheduler::scheduleIndexing(), which creates UnindexedFileIndexer and schedules UnindexedFileIndexer::run() on a Qt worker thread.

The crash happens in UnindexedFileIndexer::run(). I don't know if there's thread-unsafety going on, whether anything on the main thread invoked by Qt's event loop accesses Baloo::Database or MDB_env concurrently. Perhaps I could try wrapping UnindexedFileIndexer or Baloo::Database in a mutex, and see who accesses it.

Comment 12 tagwerk19 2022-06-25 07:37:08 UTC

(In reply to nyanpasu64 from comment #11)
> ## Do other processes write to the database? (unknown)
My feeling, no. It's just baloo_file and locking is absolutely critical.

I'm not so sure how/when baloo_file recognises when the index is being "read" and therefore has to append instead of update however it's clear that this is happening is you look at Bug 437754 (where you see that a "balooctl status", which seems to enumerate files to be indexed, means that updates are "appends" and the index grows dramatically).

> ... I don't know who wrote the corrupted file
I know there was a flood of "corruption" reports (Bug 389848). This issue was found but the fix left the index corrupt and it became normal to recommend purging and rebuilding the index (Bug 431664). Yes, still quite a while ago and the number of these reports is dropping away but it did resurface when people upgraded from Debian 10 to 11 (which was only the end of last year)

Would writing the "bad pointer" be noticable in the code? Might it be possible to add asserts?

> The crash happens in UnindexedFileIndexer::run()
Interesting in that baloo "batches up" its content indexing work (where it analyses 40 files at a time and writes the results to the index) however it deals with the initial scan of files it needs to index in a single tranche; give it a hundred thousand files it needs to index, it will collect the information for all of them and write the results to the index in one go. This can be pretty horrible (see Bug 394750)

No reason that this is a cause but it is a behaviour that might raise the stakes...

> ... evaluate the performance differences
One of the joys of baloo is it's amazing speed, that you can type a search string and see the results refine themselves on screen.

Comment 13 nyanpasu64 2022-06-26 03:41:04 UTC

After renaming my corrupted database to data.mdb (and keeping a backup copy), I decided to try checking if the corruption occurred in Baloo's memory or if the database was already corrupt on-disk. It's corrupt on-disk.

 > mdb_dump -s documenttimedb .|pv>/dev/null
    mdb.c:5856: Assertion 'IS_BRANCH(mc->mc_pg[mc->mc_top])' failed in mdb_cursor_sibling()
    10.1MiB 0:00:00
    fish: Process 127289, 'mdb_dump' from job 1, 'mdb_dump -s documenttimedb .|pv…' terminated by signal SIGABRT (Abort)
 > mdb_dump -a .|pv>/dev/null
    mdb.c:5856: Assertion 'IS_BRANCH(mc->mc_pg[mc->mc_top])' failed in mdb_cursor_sibling()
    68.7MiB 0:00:00

Running gdb on both `mdb_dump -s documenttimedb . -f /dev/null` and `mdb_dump -a . -f /dev/null`, I found that the bad page that crashes (a sibling of other midway pages, but holding block-like data) occurs at a *different* file/mmap offset (0x03CAB000) than my initial Baloo crash (0x5925000)! Are they similar or not? (somewhat:)

- 0x03CAB000 comes after data containing strings like "Fpathpercent" and "Fpathetique", which I believe was created by TermGenerator::indexFileNameText() inserting "F" + (words appearing in filenames) into LMDB. The page starting at 0x03CAB000 itself has a weak 10-byte periodicity. The 32-bit integer 0x00003CAB (page address >> 12) appears 10 times in the database file.

- 0x5925000 exists in a region with a strong 10-byte periodicity both before and after the page starts. The 32-bit integer 0x00005925 appears a whopping 307 times in the file!

Seeking Audacity to offset 93474816, I see data with a periodicity of 10. Valid page headers show 2-byte periodicities. This pointer doesn't point to page metadata! Either the page contents were overwritten or never written, or the page pointer was written incorrectly.

I haven't tried modifying LMDB to scan the *entire* database, continuing on errors, and logging *all* data inconsistencies. I think that would help gather more data to understand what kind of corruption is happening.

(In reply to tagwerk19 from comment #12)
> I'm not so sure how/when baloo_file recognises when the index is being
> "read" and therefore has to append instead of update however it's clear that
> this is happening is you look at Bug 437754 (where you see that a "balooctl
> status", which seems to enumerate files to be indexed, means that updates
> are "appends" and the index grows dramatically).
https://schd.ws/hosted_files/buildstuff14/96/20141120-BuildStuff-Lightning.pdf describes page reclamation. Of note:
> LMDB maintains a free list tracking the IDs of unused pages
> Old pages are reused as soon as possible, so data volumes don't grow without bound
And if you get this code wrong, it's a fast fast path to data corruption.

If I understand correctly, write transactions never erase no-longer-used pages, but only pages abandoned by an *earlier* write transaction if no active readers predate that transaction committing. So an active read transaction, which I assume snapshots the root page and relies on writers to not overwrite the tree it references, prevents writers from reusing pages freed by *all* writes which commit after the read transaction started.

So yeah, long-running read transactions cause written unused data to pile up. And since the PDF says "No compaction or garbage collection phase is ever needed", I suspect Baloo's index file size will *never* decrease, even if data gets freed (eg. by closing a long-running read transaction, excluding folders from indexing, deleting files, or turning off content indexing). This is... suboptimal.
> > ... I don't know who wrote the corrupted file
> I know there was a flood of "corruption" reports (Bug 389848). This issue
> was found but the fix left the index corrupt and it became normal to
> recommend purging and rebuilding the index (Bug 431664). Yes, still quite a
> while ago and the number of these reports is dropping away but it did
> resurface when people upgraded from Debian 10 to 11 (which was only the end
> of last year)
Reading https://www.openldap.org/lists/openldap-devel/201710/msg00019.html, I'm scared of yet another category of corruption: corrupting in-memory data queued in a write transaction, *before being committed to disk*!

Does baloo_file have any memory corruption bugs overwriting data with a 10-byte stride? I don't know!
> Interesting in that baloo "batches up" its content indexing work (where it
> analyses 40 files at a time and writes the results to the index) however it
> deals with the initial scan of files it needs to index in a single tranche;
> give it a hundred thousand files it needs to index, it will collect the
> information for all of them and write the results to the index in one go.
> This can be pretty horrible (see Bug 394750)
> 
> No reason that this is a cause but it is a behaviour that might raise the
> stakes...
This could be fixed separately I assume.
> > ... evaluate the performance differences
> One of the joys of baloo is it's amazing speed, that you can type a search
> string and see the results refine themselves on screen.
https://github.com/LumoSQL/LumoSQL claims LMDB is still somewhat faster than SQLite's standard engine (though SQLite is catching up). I trust LMDB less to avoid corrupting data though.

Comment 14 nyanpasu64 2022-06-26 19:53:52 UTC

# Identifying the bad table

I did find something interesting. Baloo creates 12 databases (lmdb's name for tables):

- postingdb
- positiondb
- docterms
- docfilenameterms
- docxatrrterms
- idtree
- idfilename
- documenttimedb
- documentdatadb
- indexingleveldb
- failediddb
- mtimedb

As mentioned, I copied my bad Baloo index to the name "data.mdb" so mdb_dump would find it.

I then created a modified build of mdb_dump that skips over corrupted subtrees (logging them) instead of aborting the program (though it fails if the *first* leaf page in the entire database/table is corrupted, which I've not encountered). I've pushed the code to https://codeberg.org/nyanpasu64/lmdb-debug.

- To build this repo, run `make -j(N)` to produce a statically linked `mdb_dump` binary which ignores the system LMDB.
- To scan a single table, log all structural errors found (midway nodes with no children, or leaf nodes with children), and measure the size of the text dump, run:
> .../lmdb/libraries/liblmdb/mdb_dump . -s (NAME) |pv -b>/dev/null
- To scan all tables together, and log each table name and all of its errors, run: 
> .../lmdb/libraries/liblmdb/mdb_dump . -a -f /dev/null
All tables aside from mtimedb have like 0-5 errors. There is no clear pattern among the contents of the bad pages; some have 16-byte headers followed by 2-byte-periodic pointers then content (like a normal LMDB page), while one of them (0x5925000) has 10-byte-periodic data.

The last corrupted table ("mtimedb") uses `MDB_INTEGERKEY | MDB_DUPSORT | MDB_DUPFIXED | MDB_INTEGERDUP`, where MDB_DUPSORT triggers a radically different internal codepath in lmdb... and this table has *182 distinct errors*.

Oddly it appears that the mtimedb table has been *entirely* replaced by a corrupted (older?) version of another table (docterms)! I don't know if mtimedb or something else caused the corruption though.

# Inspecting the bad table (mtimedb)

To dump the contents of mtimedb, I ran:

.../lmdb/libraries/liblmdb/mdb_dump . -s mtimedb -f mtimedb

The non-corrupted entries in mtimedb alternate between 8-byte (keys?), and "M" followed by a mimetype interspersed with null bytes (values?). When I run the same code on my *good* Baloo index, mdb_dump's output contains (as expected) 4-byte keys and 8-byte values. On the corrupted index file, `mdb_dump -s mtimedb` almost entirely matches `mdb_dump -s dupsort`, except the initial metadata is different (duplicates, dupsort, etc.) and their endings are different. Regular `mdb_dump -s mtimedb` aborts before reaching the end, but after writing 33.0 out of 33.1 MiB (including a bit of different data).

Baloo's source code matches the good index; it treats the "mtimedb" table as mapping from quint32 to 1 or more quint64, with not a "M" or mimetype in sight. The docterms table appears to map from Document::id() to a list of null-separated type-prefixed tags. The exact format of values lives in Baloo's DocTermsCodec, and it *might* be better off normalized and replaced with SQL foreign keys, unless that's too slow to read from.

# Now what?

I think a corruption bug (either on-disk, or in baloo_file or baloo_file_extractor or balooctl, possibly caused by misidentifying pages referenced by the *currently written* database tree as free) caused mtimedb's root pointer to point to docterms. It's possible this was caused by simultaneous transactions on a single thread, or passing transactions across threads, though I haven't looked into it. It's also possible that items were somehow added to the free list (mentioned in https://schd.ws/hosted_files/buildstuff14/96/20141120-BuildStuff-Lightning.pdf) despite still being referenced, and were overwritten by new contents.

We know mtimedb is pointing to the wrong table. Why does reading it report corrupted data at the end? I think it's not a result of misinterpreting the docterms database (non-dupsort) as dupsort, because 99% of the database is read properly and is identical to docterms.

- Maybe it was incorrectly pointed to an *old* copy of the docterms database, which was not itself overwritten, but 182 pages near the end have been overwritten by pages of a different format.
- Or it's correctly pointing to what *used* to be the mtimedb root page, but the root page was incorrectly overwritten by a docterms root page.
- Or maybe after the mtimedb pointer was corrupted to treat a non-dupsort database as a dupsort one, mutation operations corrupted the mtimedb tree heavily (and other databases randomly). (Note that docterms itself is uncorrupted; would this have corrupted it too or marked it as freed?) Semi-related: https://github.com/PowerDNS/pdns/issues/8873

Is LMDB designed to reuse the same page from multiple reachable paths (aka currently active parents), forming a DAG rather than a tree? If not, is there a program to verify that isn't happening?

Is it possible to check the current database and see if there are currently any pages reachable from the root, but also present in the "free list"? If that happened, what caused it? Multiple pages owning one page and one parent being freed? Or one parent owning a page, freeing it, but referencing it afterwards?

(At this point, is it worth asking LMDB's author Howard Chu for help?)

## Request for data

Is anyone willing to share their own corrupted database files for me to analyze, so I have more samples of how a database gets corrupted? Note that it will contain possibly-sensitive file paths (and perhaps even contents).

Alternatively, can you build and run my custom lmdb, run `.../lmdb/libraries/liblmdb/mdb_dump . -a -f /dev/null`, and report the errors detected (this does not leak personal information)?

Comment 15 tagwerk19 2022-06-26 22:14:19 UTC

> mdb_dump -a .
The rationale is that being able to dump the database gives you a necessary (although not sufficient) test that the index is OK? Nice ;-)

> I haven't tried modifying LMDB to scan the *entire* database, continuing on
> errors, and logging *all* data inconsistencies. I think that would help gather
> more data to understand what kind of corruption is happening.
There was an effort to write a consistency checker (a "baloodb"? tool). I remember it came with *many* *warnings*. I think it has dropped out of the current distributions but it rather sounds like it needs a revisit :-/

I see there are bugs resurfacing mentioning MDB_BAD_TXN (Bug 406868), I wonder if these are related...

> So yeah, long-running read transactions cause written unused data to pile up.
> And since the PDF says "No compaction or garbage collection phase is ever
> needed", I suspect Baloo's index file size will *never* decrease, even if data
> gets freed (eg. by closing a long-running read transaction, excluding folders
> from indexing, deleting files, or turning off content indexing). This is...
> suboptimal.
I see the behaviour of baloo grabbing space and not releasing it; the index gradually increases in size with time. I'm not so worried about the disc usage but that "rather sparse" data might be pulled into memory is not so good.

There is the option to copy/compress the database:
    mdb_copy -n -c index index.new
Sometime this does well, sometimes just so-so...

> Reading https://www.openldap.org/lists/openldap-devel/201710/msg00019.html
It *may* be that this is/was the upstream responsible for Bug 389848 as
    https://bugs.openldap.org/show_bug.cgi?id=8756
is referenced.

> Seeking Audacity to offset 93474816, I see data with a periodicity of 10...
You are going too deep for me and I doubt that I'm be able to help much. Let me try the "mdb_dump -a -n index" trick to see if I get any catches though.

It might be worth confirming you hit trouble with the database on an ext4 filesystem (and not BTRFS where I'd want to know that COW is disabled on the directory).

Comment 16 tagwerk19 2022-06-27 08:36:40 UTC

(In reply to tagwerk19 from comment #15)
> There was an effort to write a consistency checker (a "baloodb"? tool). I
> remember it came with *many* *warnings*. I think it has dropped out of the
> current distributions but it rather sounds like it needs a revisit :-/
See
    https://invent.kde.org/frameworks/baloo/-/tree/master/src/tools/experimental/baloodb

(In reply to nyanpasu64 from comment #14)
> # Identifying the bad table
> The last corrupted table ("mtimedb") ... and this table has *182 distinct errors*.
Could easily be from just one transaction. As mentioned above when baloo starts (or after a balooctl check) it looks for changed files and does a single commit.

Comment 17 nyanpasu64 2022-06-27 10:47:30 UTC

(In reply to tagwerk19 from comment #15)
> > mdb_dump -a .
> The rationale is that being able to dump the database gives you a necessary
> (although not sufficient) test that the index is OK? Nice ;-)
It's fairly effective at detecting structural errors, though it won't detect blocks marked as free on-disk while referenced, or (*if* the database is designed to be a tree) blocks referenced twice. It also doesn't check logical consistency of Baloo's database (though the issue here is corrupted/swapped pages/blocks, not logical inconsistency of data).
> I see there are bugs resurfacing mentioning MDB_BAD_TXN (Bug 406868), I
> wonder if these are related...
I do suspect so.
> I see the behaviour of baloo grabbing space and not releasing it; the index
> gradually increases in size with time. I'm not so worried about the disc
> usage but that "rather sparse" data might be pulled into memory is not so
> good.
> 
> There is the option to copy/compress the database:
>     mdb_copy -n -c index index.new
> Sometime this does well, sometimes just so-so...
Noted.
> It might be worth confirming you hit trouble with the database on an ext4
> filesystem (and not BTRFS where I'd want to know that COW is disabled on the
> directory).
I'm on btrfs, lsattr ~/.local/share/ shows baloo as C (nocow), and lsattr ~/.local/share/baloo shows both index and index.lock as C (nocow).

Comment 18 tagwerk19 2022-06-28 07:11:59 UTC

(In reply to nyanpasu64 from comment #17)
> (In reply to tagwerk19 from comment #15)
> I'm on btrfs ...
It might be worth checking whether you are mounting your "/home"under different device numbers. Can happen with BTRFS with multiple subvols - such as on OpenSUSE. I've not seen it happen with Fedora.

The test was described:
    https://bugs.kde.org/show_bug.cgi?id=402154#c12
and there's a broader overview here:
    https://bugs.kde.org/show_bug.cgi?id=400704#c31

As above - No reason that this is a cause but it adds stress and unhappiness...

Comment 19 Nicolas Fella 2022-10-28 11:30:13 UTC

*** Bug 461096 has been marked as a duplicate of this bug. ***