Bug 475975

Summary: QMobipocket fails to decompress Mobipocket version >= 5
Product: [Frameworks and Libraries] kdegraphics-mobipocket Reporter: Piotr Keplicz <keplicz>
Component: generalAssignee: Unassigned bugs mailing-list <unassigned-bugs>
Status: CONFIRMED ---    
Severity: crash CC: goran.grbic, harveyrasp, hoyanmok, kde, lukas, mreich1978, nate, rhodry47, stefan.bruens, tagwerk19
Priority: HI Keywords: drkonqi
Version: unspecified   
Target Milestone: ---   
Platform: Neon   
OS: Linux   
See Also: https://bugs.kde.org/show_bug.cgi?id=475730
https://bugs.kde.org/show_bug.cgi?id=489275
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Piotr Keplicz 2023-10-22 17:27:13 UTC
Application: baloo_file_extractor (5.111.0)

Qt Version: 5.15.11
Frameworks Version: 5.111.0
Operating System: Linux 6.2.0-35-generic x86_64
Windowing System: X11
Distribution: KDE neon 5.27
DrKonqi: 5.27.8 [KCrashBackend]

-- Information about the crash:
Baloo crashes with Segmentation fault while indexing files. After the process is automatically restarted, it crashes again.

The crash can be reproduced every time.

-- Backtrace:
Application: Wydobywanie z plików dla Baloo (baloo_file_extractor), signal: Segmentation fault

[KCrash Handler]
#4  0x00007f441228e368 in QArrayData::data (this=<optimized out>) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:61
#5  QTypedArrayData<QTextHtmlParserNode>::data (this=<optimized out>) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:208
#6  QTypedArrayData<QTextHtmlParserNode>::begin (this=<optimized out>) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:211
#7  QVector<QTextHtmlParserNode>::realloc (this=this@entry=0x7ffd754dec10, aalloc=6242685, options=...) at ../../include/QtCore/../../src/corelib/tools/qvector.h:710
#8  0x00007f441228e8e0 in QVector<QTextHtmlParserNode>::resize (this=0x7ffd754dec10, asize=6242685) at ../../include/QtCore/../../src/corelib/tools/qvector.h:431
#9  0x00007f4412286399 in QTextHtmlParser::newNode (this=this@entry=0x7ffd754dec10, parent=6242682) at text/qtexthtmlparser.cpp:566
#10 0x00007f4412286624 in QTextHtmlParser::parseCloseTag (this=0x7ffd754dec10) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:61
#11 0x00007f441228d5f0 in QTextHtmlParser::parse (this=this@entry=0x7ffd754dec10) at text/qtexthtmlparser.cpp:640
#12 0x00007f441228d6ad in QTextHtmlParser::parse (this=this@entry=0x7ffd754dec10, text=..., _resourceProvider=0x7f441228d5f0 <QTextHtmlParser::parse()+496>, _resourceProvider@entry=0x7ffd754ded30) at text/qtexthtmlparser.cpp:583
#13 0x00007f44122b7cd3 in QTextHtmlImporter::QTextHtmlImporter (this=this@entry=0x7ffd754dec10, _doc=_doc@entry=0x7ffd754ded30, _html=..., mode=mode@entry=QTextHtmlImporter::ImportToDocument, resourceProvider=resourceProvider@entry=0x0) at text/qtextdocumentfragment.cpp:445
#14 0x00007f441226b3b9 in QTextDocument::setHtml (this=this@entry=0x7ffd754ded30, html=...) at text/qtextdocument.cpp:1273
#15 0x00007f440c5bfb57 in KFileMetaData::MobiExtractor::extract (this=<optimized out>, result=0x7ffd754def70) at ./src/extractors/mobiextractor.cpp:96
#16 0x000055fd799db0f0 in Baloo::App::index (this=this@entry=0x7ffd754df600, tr=0x7f44080103b0, url=..., id=id@entry=93550975361922770) at ./src/file/extractor/app.cpp:170
#17 0x000055fd799dd0ce in Baloo::App::processNextFile (this=0x7ffd754df600) at ./src/file/extractor/app.cpp:102
#18 0x00007f4411cf8456 in QtPrivate::QSlotObjectBase::call (a=0x7ffd754df190, r=<optimized out>, this=<optimized out>) at ../../include/QtCore/../../src/corelib/kernel/qobjectdefs_impl.h:398
#19 QSingleShotTimer::timerEvent (this=0x55fd7b8d1000) at kernel/qtimer.cpp:322
#20 0x00007f4411ce9cff in QObject::event (this=0x55fd7b8d1000, e=0x7ffd754df2d0) at kernel/qobject.cpp:1369
#21 0x00007f4411cbc88a in QCoreApplication::notifyInternal2 (receiver=0x55fd7b8d1000, event=0x7ffd754df2d0) at kernel/qcoreapplication.cpp:1064
#22 0x00007f4411d150ab in QTimerInfoList::activateTimers (this=0x55fd7b88a900) at kernel/qtimerinfo_unix.cpp:643
#23 0x00007f4411d159ac in timerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:183
#24 0x00007f4410920d3b in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#25 0x00007f4410976258 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#26 0x00007f441091e3e3 in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#27 0x00007f4411d15d78 in QEventDispatcherGlib::processEvents (this=0x55fd7b882490, flags=...) at kernel/qeventdispatcher_glib.cpp:423
#28 0x00007f4411cbb1ab in QEventLoop::exec (this=this@entry=0x7ffd754df510, flags=..., flags@entry=...) at ../../include/QtCore/../../src/corelib/global/qflags.h:69
#29 0x00007f4411cc3754 in QCoreApplication::exec () at ../../include/QtCore/../../src/corelib/global/qflags.h:121
#30 0x00007f4412136d50 in QGuiApplication::exec () at kernel/qguiapplication.cpp:1863
#31 0x000055fd799d2f83 in main (argc=<optimized out>, argv=<optimized out>) at ./src/file/extractor/main.cpp:43
[Inferior 1 (process 18590) detached]

Reported using DrKonqi
Comment 1 tagwerk19 2023-10-22 17:55:12 UTC
(In reply to Piotr Keplicz from comment #0)
> #15 0x00007f440c5bfb57 in KFileMetaData::MobiExtractor::extract (this=<optimized out>, result=0x7ffd754def70) at ./src/extractors/mobiextractor.cpp:96
See Bug 475730...

If you can identify the .mobi file, maybe exclude it from indexing (or convert to ePub?). Not found a validator for .mobi
Comment 2 tagwerk19 2023-10-22 18:15:31 UTC
Might be worth checking that .mobi indexing is working "in general". 

I've just tested the samples on
    https://filesamples.com/formats/mobi
on Neon and Fedora 38. Seems OK.
Comment 3 Piotr Keplicz 2023-10-23 13:45:19 UTC
The offending file is a Polish dictionary: https://xn--zabaaganionemiejsce-8fd.pl/cc-sjp/SJP2-202302181714.mobi
Comment 4 tagwerk19 2023-10-24 06:49:30 UTC
(In reply to Piotr Keplicz from comment #3)
> The offending file is a Polish dictionary:
That's likely to be a challenge :-)

Baloo will try to set up a record in the index for every word in the dictionary, with a link to its location in the file. That will be a BIG transaction. I see baloo_file_extractor start indexing the file but slow to a crawl, due to the "MemoryHigh=512MB" cap on RAM usage in the kde-baloo.service unit file. Swap usage goes up quickly, presumably dirty pages waiting to be committed.

A personal view, I don't think baloo should use swap... 

If I change the systemd limits to:

    MemoryHigh=50%
    MemorySwapMax=0B

and give my test VM 16GB to work in, I see the baloo_file_extractor crash.

Tested on Fedora38 and Neon User. Confirming.

If I try on Neon Unstable, it looks as if baloo skips the content indexing of the file. A

    $ balooshow -x SJP2-202302181714.mobi

just gives...

    141b30ed0da2dd 3977093853 1317680 SJP2-202302181714.mobi [/home/test/Testdir/SJP2-202302181714.mobi]
            Mtime: 1698129246 2023-10-24T08:34:06
            Ctime: 1698129246 2023-10-24T08:34:06

    Internal Info
    File Name Terms: F202302181714 Fmobi Fsjp2
    XAttr Terms:
    Plain Text Terms:
    Property Terms: Mapplication Mebook Mmobipocket Mx T5
Comment 5 tagwerk19 2023-10-25 22:27:39 UTC
It seems possible to convert the SJP2-202302181714.mobi to an epub with Calibre:

    $ ebook-convert SJP2-202302181714.mobi SJP2-202302181714.epub --dont-split-on-page-breaks

    Conversion options changed from defaults:
      dont_split_on_page_breaks: True
    1% Converting input to HTML...
    InputFormatPlugin: MOBI Input running
    on /home/test/Downloads/SJP2-202302181714.mobi
    Malformed markup, parsing using html5-parser
    Parsing all content...
    HTML 5 parsing failed, falling back to older parsers
    Forcing index.html into XHTML namespace
    Generating default TOC from spine...
    34% Running transforms on e-book...
    Merging user specified metadata...
    Detecting structure...
    Auto generated TOC with 0 entries.
    Flattening CSS and remapping font sizes...
    Source base font size is 12.00000pt
    Removing fake margins...
    Cleaning up manifest...
    Trimming unused files from manifest...
    Trimming 'images/00003.jpg' from manifest
    Trimming 'images/00001.jpg' from manifest
    Creating EPUB Output...
    67% Running EPUB Output plug-in
    Splitting markup on page breaks and flow limits, if any...
            Looking for large trees in index.html...
            Found large tree #0
            Split into 17 parts
    This EPUB file has no Table of Contents. Creating a default TOC
    The cover image has an id != "cover". Renaming to work around bug in Nook Color
    EPUB output written to /home/test/Downloads/SJP2-202302181714.epub
    Output saved to   /home/test/Downloads/SJP2-202302181714.epub

Without the "dont-split-on-page-breaks" option, the process stuck at the 67% mark.

There's a comment about "Malformed markup" and then a "HTML5 parsing failed" but whether these are enough to break Baloo's mobiextractor code...

Nevertheless Baloo can index the epub, maybe this is good enough.
Comment 6 Piotr Keplicz 2023-10-26 12:19:48 UTC
Thanks for help :)
Comment 7 Stefan Brüns 2023-11-10 03:47:15 UTC
I have compared the raw html output from QMobipocket and https://github.com/iscc/mobi, and apparently the output from QMobipocket is fairly screwed up.

One likely source for this screwup is the missing treatment of trailing data in each PDB section, which exists since Mobipocket version 5.

Unfortunately, QMobipocket is essentially unmaintained for 10 years now ...
Comment 8 Stefan Brüns 2023-11-10 03:58:04 UTC
There is no product for https://invent.kde.org/graphics/kdegraphics-mobipocket, thus assigning to "kde".
Comment 9 Stefan Brüns 2023-11-10 14:13:14 UTC
https://wiki.mobileread.com/wiki/MOBI#Trailing_entries

This is implemented in the last FBreader GPLed sources (now closed source), Calibre, python-mobi and various others.
Comment 10 Stefan Brüns 2023-11-10 14:18:25 UTC
*** Bug 475730 has been marked as a duplicate of this bug. ***
Comment 11 Nate Graham 2024-02-16 21:12:29 UTC
Created a Bugzilla product for it; moving there.
Comment 12 tagwerk19 2024-06-21 15:59:54 UTC
*** Bug 488587 has been marked as a duplicate of this bug. ***
Comment 13 tagwerk19 2024-06-21 16:12:40 UTC
*** Bug 487481 has been marked as a duplicate of this bug. ***
Comment 14 tagwerk19 2024-06-21 16:17:53 UTC
*** Bug 486853 has been marked as a duplicate of this bug. ***
Comment 15 tagwerk19 2024-07-02 16:45:31 UTC
*** Bug 489612 has been marked as a duplicate of this bug. ***
Comment 16 tagwerk19 2024-07-13 15:50:54 UTC
*** Bug 490210 has been marked as a duplicate of this bug. ***
Comment 17 tagwerk19 2024-07-18 11:50:04 UTC
*** Bug 490446 has been marked as a duplicate of this bug. ***