Application: baloo_file_extractor (5.111.0) Qt Version: 5.15.11 Frameworks Version: 5.111.0 Operating System: Linux 6.2.0-35-generic x86_64 Windowing System: X11 Distribution: KDE neon 5.27 DrKonqi: 5.27.8 [KCrashBackend] -- Information about the crash: Baloo crashes with Segmentation fault while indexing files. After the process is automatically restarted, it crashes again. The crash can be reproduced every time. -- Backtrace: Application: Wydobywanie z plików dla Baloo (baloo_file_extractor), signal: Segmentation fault [KCrash Handler] #4 0x00007f441228e368 in QArrayData::data (this=<optimized out>) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:61 #5 QTypedArrayData<QTextHtmlParserNode>::data (this=<optimized out>) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:208 #6 QTypedArrayData<QTextHtmlParserNode>::begin (this=<optimized out>) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:211 #7 QVector<QTextHtmlParserNode>::realloc (this=this@entry=0x7ffd754dec10, aalloc=6242685, options=...) at ../../include/QtCore/../../src/corelib/tools/qvector.h:710 #8 0x00007f441228e8e0 in QVector<QTextHtmlParserNode>::resize (this=0x7ffd754dec10, asize=6242685) at ../../include/QtCore/../../src/corelib/tools/qvector.h:431 #9 0x00007f4412286399 in QTextHtmlParser::newNode (this=this@entry=0x7ffd754dec10, parent=6242682) at text/qtexthtmlparser.cpp:566 #10 0x00007f4412286624 in QTextHtmlParser::parseCloseTag (this=0x7ffd754dec10) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:61 #11 0x00007f441228d5f0 in QTextHtmlParser::parse (this=this@entry=0x7ffd754dec10) at text/qtexthtmlparser.cpp:640 #12 0x00007f441228d6ad in QTextHtmlParser::parse (this=this@entry=0x7ffd754dec10, text=..., _resourceProvider=0x7f441228d5f0 <QTextHtmlParser::parse()+496>, _resourceProvider@entry=0x7ffd754ded30) at text/qtexthtmlparser.cpp:583 #13 0x00007f44122b7cd3 in QTextHtmlImporter::QTextHtmlImporter (this=this@entry=0x7ffd754dec10, _doc=_doc@entry=0x7ffd754ded30, _html=..., mode=mode@entry=QTextHtmlImporter::ImportToDocument, resourceProvider=resourceProvider@entry=0x0) at text/qtextdocumentfragment.cpp:445 #14 0x00007f441226b3b9 in QTextDocument::setHtml (this=this@entry=0x7ffd754ded30, html=...) at text/qtextdocument.cpp:1273 #15 0x00007f440c5bfb57 in KFileMetaData::MobiExtractor::extract (this=<optimized out>, result=0x7ffd754def70) at ./src/extractors/mobiextractor.cpp:96 #16 0x000055fd799db0f0 in Baloo::App::index (this=this@entry=0x7ffd754df600, tr=0x7f44080103b0, url=..., id=id@entry=93550975361922770) at ./src/file/extractor/app.cpp:170 #17 0x000055fd799dd0ce in Baloo::App::processNextFile (this=0x7ffd754df600) at ./src/file/extractor/app.cpp:102 #18 0x00007f4411cf8456 in QtPrivate::QSlotObjectBase::call (a=0x7ffd754df190, r=<optimized out>, this=<optimized out>) at ../../include/QtCore/../../src/corelib/kernel/qobjectdefs_impl.h:398 #19 QSingleShotTimer::timerEvent (this=0x55fd7b8d1000) at kernel/qtimer.cpp:322 #20 0x00007f4411ce9cff in QObject::event (this=0x55fd7b8d1000, e=0x7ffd754df2d0) at kernel/qobject.cpp:1369 #21 0x00007f4411cbc88a in QCoreApplication::notifyInternal2 (receiver=0x55fd7b8d1000, event=0x7ffd754df2d0) at kernel/qcoreapplication.cpp:1064 #22 0x00007f4411d150ab in QTimerInfoList::activateTimers (this=0x55fd7b88a900) at kernel/qtimerinfo_unix.cpp:643 #23 0x00007f4411d159ac in timerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:183 #24 0x00007f4410920d3b in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #25 0x00007f4410976258 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #26 0x00007f441091e3e3 in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #27 0x00007f4411d15d78 in QEventDispatcherGlib::processEvents (this=0x55fd7b882490, flags=...) at kernel/qeventdispatcher_glib.cpp:423 #28 0x00007f4411cbb1ab in QEventLoop::exec (this=this@entry=0x7ffd754df510, flags=..., flags@entry=...) at ../../include/QtCore/../../src/corelib/global/qflags.h:69 #29 0x00007f4411cc3754 in QCoreApplication::exec () at ../../include/QtCore/../../src/corelib/global/qflags.h:121 #30 0x00007f4412136d50 in QGuiApplication::exec () at kernel/qguiapplication.cpp:1863 #31 0x000055fd799d2f83 in main (argc=<optimized out>, argv=<optimized out>) at ./src/file/extractor/main.cpp:43 [Inferior 1 (process 18590) detached] Reported using DrKonqi
(In reply to Piotr Keplicz from comment #0) > #15 0x00007f440c5bfb57 in KFileMetaData::MobiExtractor::extract (this=<optimized out>, result=0x7ffd754def70) at ./src/extractors/mobiextractor.cpp:96 See Bug 475730... If you can identify the .mobi file, maybe exclude it from indexing (or convert to ePub?). Not found a validator for .mobi
Might be worth checking that .mobi indexing is working "in general". I've just tested the samples on https://filesamples.com/formats/mobi on Neon and Fedora 38. Seems OK.
The offending file is a Polish dictionary: https://xn--zabaaganionemiejsce-8fd.pl/cc-sjp/SJP2-202302181714.mobi
(In reply to Piotr Keplicz from comment #3) > The offending file is a Polish dictionary: That's likely to be a challenge :-) Baloo will try to set up a record in the index for every word in the dictionary, with a link to its location in the file. That will be a BIG transaction. I see baloo_file_extractor start indexing the file but slow to a crawl, due to the "MemoryHigh=512MB" cap on RAM usage in the kde-baloo.service unit file. Swap usage goes up quickly, presumably dirty pages waiting to be committed. A personal view, I don't think baloo should use swap... If I change the systemd limits to: MemoryHigh=50% MemorySwapMax=0B and give my test VM 16GB to work in, I see the baloo_file_extractor crash. Tested on Fedora38 and Neon User. Confirming. If I try on Neon Unstable, it looks as if baloo skips the content indexing of the file. A $ balooshow -x SJP2-202302181714.mobi just gives... 141b30ed0da2dd 3977093853 1317680 SJP2-202302181714.mobi [/home/test/Testdir/SJP2-202302181714.mobi] Mtime: 1698129246 2023-10-24T08:34:06 Ctime: 1698129246 2023-10-24T08:34:06 Internal Info File Name Terms: F202302181714 Fmobi Fsjp2 XAttr Terms: Plain Text Terms: Property Terms: Mapplication Mebook Mmobipocket Mx T5
It seems possible to convert the SJP2-202302181714.mobi to an epub with Calibre: $ ebook-convert SJP2-202302181714.mobi SJP2-202302181714.epub --dont-split-on-page-breaks Conversion options changed from defaults: dont_split_on_page_breaks: True 1% Converting input to HTML... InputFormatPlugin: MOBI Input running on /home/test/Downloads/SJP2-202302181714.mobi Malformed markup, parsing using html5-parser Parsing all content... HTML 5 parsing failed, falling back to older parsers Forcing index.html into XHTML namespace Generating default TOC from spine... 34% Running transforms on e-book... Merging user specified metadata... Detecting structure... Auto generated TOC with 0 entries. Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Removing fake margins... Cleaning up manifest... Trimming unused files from manifest... Trimming 'images/00003.jpg' from manifest Trimming 'images/00001.jpg' from manifest Creating EPUB Output... 67% Running EPUB Output plug-in Splitting markup on page breaks and flow limits, if any... Looking for large trees in index.html... Found large tree #0 Split into 17 parts This EPUB file has no Table of Contents. Creating a default TOC The cover image has an id != "cover". Renaming to work around bug in Nook Color EPUB output written to /home/test/Downloads/SJP2-202302181714.epub Output saved to /home/test/Downloads/SJP2-202302181714.epub Without the "dont-split-on-page-breaks" option, the process stuck at the 67% mark. There's a comment about "Malformed markup" and then a "HTML5 parsing failed" but whether these are enough to break Baloo's mobiextractor code... Nevertheless Baloo can index the epub, maybe this is good enough.
Thanks for help :)
I have compared the raw html output from QMobipocket and https://github.com/iscc/mobi, and apparently the output from QMobipocket is fairly screwed up. One likely source for this screwup is the missing treatment of trailing data in each PDB section, which exists since Mobipocket version 5. Unfortunately, QMobipocket is essentially unmaintained for 10 years now ...
There is no product for https://invent.kde.org/graphics/kdegraphics-mobipocket, thus assigning to "kde".
https://wiki.mobileread.com/wiki/MOBI#Trailing_entries This is implemented in the last FBreader GPLed sources (now closed source), Calibre, python-mobi and various others.
*** Bug 475730 has been marked as a duplicate of this bug. ***
Created a Bugzilla product for it; moving there.
*** Bug 488587 has been marked as a duplicate of this bug. ***
*** Bug 487481 has been marked as a duplicate of this bug. ***
*** Bug 486853 has been marked as a duplicate of this bug. ***
*** Bug 489612 has been marked as a duplicate of this bug. ***
*** Bug 490210 has been marked as a duplicate of this bug. ***
*** Bug 490446 has been marked as a duplicate of this bug. ***