Bug 475975

Summary:	QMobipocket fails to decompress Mobipocket version >= 5
Product:	[Frameworks and Libraries] kdegraphics-mobipocket	Reporter:	Piotr Keplicz <keplicz>
Component:	general	Assignee:	Unassigned bugs <unassigned-bugs-null>
Status:	RESOLVED FIXED
Severity:	crash	CC:	goran.grbic, harveyrasp, hoyanmok, kde, lukas, mreich1978, nate, rhodry47, stefan.bruens, tagwerk19
Priority:	HI	Keywords:	drkonqi
Version First Reported In:	unspecified
Target Milestone:	---
Platform:	Neon
OS:	Linux
See Also:	https://bugs.kde.org/show_bug.cgi?id=475730 https://bugs.kde.org/show_bug.cgi?id=489275 https://bugs.kde.org/show_bug.cgi?id=335975
Latest Commit:	https://invent.kde.org/graphics/kdegraphics-mobipocket/-/commit/439a01662e72102e114a46d168fbabbb4de04184	Version Fixed/Implemented In:
Sentry Crash Report:

Description Piotr Keplicz 2023-10-22 17:27:13 UTC

Application: baloo_file_extractor (5.111.0)

Qt Version: 5.15.11
Frameworks Version: 5.111.0
Operating System: Linux 6.2.0-35-generic x86_64
Windowing System: X11
Distribution: KDE neon 5.27
DrKonqi: 5.27.8 [KCrashBackend]

-- Information about the crash:
Baloo crashes with Segmentation fault while indexing files. After the process is automatically restarted, it crashes again.

The crash can be reproduced every time.

-- Backtrace:
Application: Wydobywanie z plików dla Baloo (baloo_file_extractor), signal: Segmentation fault

[KCrash Handler]
#4  0x00007f441228e368 in QArrayData::data (this=<optimized out>) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:61
#5  QTypedArrayData<QTextHtmlParserNode>::data (this=<optimized out>) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:208
#6  QTypedArrayData<QTextHtmlParserNode>::begin (this=<optimized out>) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:211
#7  QVector<QTextHtmlParserNode>::realloc (this=this@entry=0x7ffd754dec10, aalloc=6242685, options=...) at ../../include/QtCore/../../src/corelib/tools/qvector.h:710
#8  0x00007f441228e8e0 in QVector<QTextHtmlParserNode>::resize (this=0x7ffd754dec10, asize=6242685) at ../../include/QtCore/../../src/corelib/tools/qvector.h:431
#9  0x00007f4412286399 in QTextHtmlParser::newNode (this=this@entry=0x7ffd754dec10, parent=6242682) at text/qtexthtmlparser.cpp:566
#10 0x00007f4412286624 in QTextHtmlParser::parseCloseTag (this=0x7ffd754dec10) at ../../include/QtCore/../../src/corelib/tools/qarraydata.h:61
#11 0x00007f441228d5f0 in QTextHtmlParser::parse (this=this@entry=0x7ffd754dec10) at text/qtexthtmlparser.cpp:640
#12 0x00007f441228d6ad in QTextHtmlParser::parse (this=this@entry=0x7ffd754dec10, text=..., _resourceProvider=0x7f441228d5f0 <QTextHtmlParser::parse()+496>, _resourceProvider@entry=0x7ffd754ded30) at text/qtexthtmlparser.cpp:583
#13 0x00007f44122b7cd3 in QTextHtmlImporter::QTextHtmlImporter (this=this@entry=0x7ffd754dec10, _doc=_doc@entry=0x7ffd754ded30, _html=..., mode=mode@entry=QTextHtmlImporter::ImportToDocument, resourceProvider=resourceProvider@entry=0x0) at text/qtextdocumentfragment.cpp:445
#14 0x00007f441226b3b9 in QTextDocument::setHtml (this=this@entry=0x7ffd754ded30, html=...) at text/qtextdocument.cpp:1273
#15 0x00007f440c5bfb57 in KFileMetaData::MobiExtractor::extract (this=<optimized out>, result=0x7ffd754def70) at ./src/extractors/mobiextractor.cpp:96
#16 0x000055fd799db0f0 in Baloo::App::index (this=this@entry=0x7ffd754df600, tr=0x7f44080103b0, url=..., id=id@entry=93550975361922770) at ./src/file/extractor/app.cpp:170
#17 0x000055fd799dd0ce in Baloo::App::processNextFile (this=0x7ffd754df600) at ./src/file/extractor/app.cpp:102
#18 0x00007f4411cf8456 in QtPrivate::QSlotObjectBase::call (a=0x7ffd754df190, r=<optimized out>, this=<optimized out>) at ../../include/QtCore/../../src/corelib/kernel/qobjectdefs_impl.h:398
#19 QSingleShotTimer::timerEvent (this=0x55fd7b8d1000) at kernel/qtimer.cpp:322
#20 0x00007f4411ce9cff in QObject::event (this=0x55fd7b8d1000, e=0x7ffd754df2d0) at kernel/qobject.cpp:1369
#21 0x00007f4411cbc88a in QCoreApplication::notifyInternal2 (receiver=0x55fd7b8d1000, event=0x7ffd754df2d0) at kernel/qcoreapplication.cpp:1064
#22 0x00007f4411d150ab in QTimerInfoList::activateTimers (this=0x55fd7b88a900) at kernel/qtimerinfo_unix.cpp:643
#23 0x00007f4411d159ac in timerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:183
#24 0x00007f4410920d3b in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#25 0x00007f4410976258 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#26 0x00007f441091e3e3 in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#27 0x00007f4411d15d78 in QEventDispatcherGlib::processEvents (this=0x55fd7b882490, flags=...) at kernel/qeventdispatcher_glib.cpp:423
#28 0x00007f4411cbb1ab in QEventLoop::exec (this=this@entry=0x7ffd754df510, flags=..., flags@entry=...) at ../../include/QtCore/../../src/corelib/global/qflags.h:69
#29 0x00007f4411cc3754 in QCoreApplication::exec () at ../../include/QtCore/../../src/corelib/global/qflags.h:121
#30 0x00007f4412136d50 in QGuiApplication::exec () at kernel/qguiapplication.cpp:1863
#31 0x000055fd799d2f83 in main (argc=<optimized out>, argv=<optimized out>) at ./src/file/extractor/main.cpp:43
[Inferior 1 (process 18590) detached]

Reported using DrKonqi

Comment 1 tagwerk19 2023-10-22 17:55:12 UTC

(In reply to Piotr Keplicz from comment #0)
> #15 0x00007f440c5bfb57 in KFileMetaData::MobiExtractor::extract (this=<optimized out>, result=0x7ffd754def70) at ./src/extractors/mobiextractor.cpp:96
See Bug 475730...

If you can identify the .mobi file, maybe exclude it from indexing (or convert to ePub?). Not found a validator for .mobi

Comment 2 tagwerk19 2023-10-22 18:15:31 UTC

Might be worth checking that .mobi indexing is working "in general". 

I've just tested the samples on
    https://filesamples.com/formats/mobi
on Neon and Fedora 38. Seems OK.

Comment 3 Piotr Keplicz 2023-10-23 13:45:19 UTC

The offending file is a Polish dictionary: https://xn--zabaaganionemiejsce-8fd.pl/cc-sjp/SJP2-202302181714.mobi

Comment 4 tagwerk19 2023-10-24 06:49:30 UTC

(In reply to Piotr Keplicz from comment #3)
> The offending file is a Polish dictionary:
That's likely to be a challenge :-)

Baloo will try to set up a record in the index for every word in the dictionary, with a link to its location in the file. That will be a BIG transaction. I see baloo_file_extractor start indexing the file but slow to a crawl, due to the "MemoryHigh=512MB" cap on RAM usage in the kde-baloo.service unit file. Swap usage goes up quickly, presumably dirty pages waiting to be committed.

A personal view, I don't think baloo should use swap... 

If I change the systemd limits to:

    MemoryHigh=50%
    MemorySwapMax=0B

and give my test VM 16GB to work in, I see the baloo_file_extractor crash.

Tested on Fedora38 and Neon User. Confirming.

If I try on Neon Unstable, it looks as if baloo skips the content indexing of the file. A

    $ balooshow -x SJP2-202302181714.mobi

just gives...

    141b30ed0da2dd 3977093853 1317680 SJP2-202302181714.mobi [/home/test/Testdir/SJP2-202302181714.mobi]
            Mtime: 1698129246 2023-10-24T08:34:06
            Ctime: 1698129246 2023-10-24T08:34:06

    Internal Info
    File Name Terms: F202302181714 Fmobi Fsjp2
    XAttr Terms:
    Plain Text Terms:
    Property Terms: Mapplication Mebook Mmobipocket Mx T5

Comment 5 tagwerk19 2023-10-25 22:27:39 UTC

It seems possible to convert the SJP2-202302181714.mobi to an epub with Calibre:

$ ebook-convert SJP2-202302181714.mobi SJP2-202302181714.epub --dont-split-on-page-breaks

Conversion options changed from defaults:
dont_split_on_page_breaks: True
1% Converting input to HTML...
InputFormatPlugin: MOBI Input running
on /home/test/Downloads/SJP2-202302181714.mobi
Malformed markup, parsing using html5-parser
Parsing all content...
HTML 5 parsing failed, falling back to older parsers
Forcing index.html into XHTML namespace
Generating default TOC from spine...
34% Running transforms on e-book...
Merging user specified metadata...
Detecting structure...
Auto generated TOC with 0 entries.
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Cleaning up manifest...
Trimming unused files from manifest...
Trimming 'images/00003.jpg' from manifest
Trimming 'images/00001.jpg' from manifest
Creating EPUB Output...
67% Running EPUB Output plug-in
Splitting markup on page breaks and flow limits, if any...
Looking for large trees in index.html...
Found large tree #0
Split into 17 parts
This EPUB file has no Table of Contents. Creating a default TOC
The cover image has an id != "cover". Renaming to work around bug in Nook Color
EPUB output written to /home/test/Downloads/SJP2-202302181714.epub
Output saved to /home/test/Downloads/SJP2-202302181714.epub

Without the "dont-split-on-page-breaks" option, the process stuck at the 67% mark.

There's a comment about "Malformed markup" and then a "HTML5 parsing failed" but whether these are enough to break Baloo's mobiextractor code...

Nevertheless Baloo can index the epub, maybe this is good enough.

Comment 6 Piotr Keplicz 2023-10-26 12:19:48 UTC

Thanks for help :)

Comment 7 Stefan Brüns 2023-11-10 03:47:15 UTC

I have compared the raw html output from QMobipocket and https://github.com/iscc/mobi, and apparently the output from QMobipocket is fairly screwed up.

One likely source for this screwup is the missing treatment of trailing data in each PDB section, which exists since Mobipocket version 5.

Unfortunately, QMobipocket is essentially unmaintained for 10 years now ...

Comment 8 Stefan Brüns 2023-11-10 03:58:04 UTC

There is no product for https://invent.kde.org/graphics/kdegraphics-mobipocket, thus assigning to "kde".

Comment 9 Stefan Brüns 2023-11-10 14:13:14 UTC

https://wiki.mobileread.com/wiki/MOBI#Trailing_entries

This is implemented in the last FBreader GPLed sources (now closed source), Calibre, python-mobi and various others.

Comment 10 Stefan Brüns 2023-11-10 14:18:25 UTC

*** Bug 475730 has been marked as a duplicate of this bug. ***

Comment 11 Nate Graham 2024-02-16 21:12:29 UTC

Created a Bugzilla product for it; moving there.

Comment 12 tagwerk19 2024-06-21 15:59:54 UTC

*** Bug 488587 has been marked as a duplicate of this bug. ***

Comment 13 tagwerk19 2024-06-21 16:12:40 UTC

*** Bug 487481 has been marked as a duplicate of this bug. ***

Comment 14 tagwerk19 2024-06-21 16:17:53 UTC

*** Bug 486853 has been marked as a duplicate of this bug. ***

Comment 15 tagwerk19 2024-07-02 16:45:31 UTC

*** Bug 489612 has been marked as a duplicate of this bug. ***

Comment 16 tagwerk19 2024-07-13 15:50:54 UTC

*** Bug 490210 has been marked as a duplicate of this bug. ***

Comment 17 tagwerk19 2024-07-18 11:50:04 UTC

*** Bug 490446 has been marked as a duplicate of this bug. ***

Comment 18 Stefan Brüns 2025-02-23 03:55:39 UTC

Git commit a188b893654fe5f88b1ebab7e8341ceb181f6dc9 by Stefan Brüns.
Committed on 23/02/2025 at 03:54.
Pushed by bruns into branch 'disable_mobipocket_text'.

[MobiExtractor] Disable buggy text extraction by default

The text extraction in mobiextractor is extremely buggy, and causes
a lot of bug reports for baloo (which then gets blamed for its
"buggyness" when calling third-party code).

QMobipocket lacks support for any halfway current mobipocket version
(last supported: 4, current: 8), and has no testsuite.

Make this opt-in ("ENABLE_MOBIPOCKET_TEXT_EXTRACTION") until the bugs
in QMobiPocket gets fixed.

SENTRY: BALOO-2N5
SENTRY: BALOO-426
SENTRY: BALOO-33
// use `stack.filename is mobipocket.cpp` for more
Related: bug 482420, bug 489275

M  +1    -0    CMakeLists.txt
M  +3    -0    src/extractors/CMakeLists.txt
M  +2    -1    src/extractors/mobiextractor.cpp

https://invent.kde.org/frameworks/kfilemetadata/-/commit/a188b893654fe5f88b1ebab7e8341ceb181f6dc9

Comment 19 Stefan Brüns 2025-03-15 12:19:49 UTC

Git commit 8bd1e61cca1e07a0ffce7ff79b861e2872662e6d by Stefan Brüns.
Committed on 15/03/2025 at 12:16.
Pushed by bruns into branch 'master'.

[MobiExtractor] Disable buggy text extraction by default

The text extraction in mobiextractor is extremely buggy, and causes
a lot of bug reports for baloo (which then gets blamed for its
"buggyness" when calling third-party code).

QMobipocket lacks support for any halfway current mobipocket version
(last supported: 4, current: 8), and has no testsuite.

Make this opt-in ("ENABLE_MOBIPOCKET_TEXT_EXTRACTION") until the bugs
in QMobiPocket gets fixed.

SENTRY: BALOO-2N5
SENTRY: BALOO-426
SENTRY: BALOO-33
// use `stack.filename is mobipocket.cpp` for more
Related: bug 482420, bug 489275

M  +1    -0    CMakeLists.txt
M  +3    -0    src/extractors/CMakeLists.txt
M  +2    -1    src/extractors/mobiextractor.cpp

https://invent.kde.org/frameworks/kfilemetadata/-/commit/8bd1e61cca1e07a0ffce7ff79b861e2872662e6d

Comment 20 Bug Janitor Service 2025-06-01 23:32:01 UTC

A possibly relevant merge request was started @ https://invent.kde.org/graphics/kdegraphics-mobipocket/-/merge_requests/35

Comment 21 Stefan Brüns 2025-06-09 08:47:50 UTC

Git commit 439a01662e72102e114a46d168fbabbb4de04184 by Stefan Brüns.
Committed on 07/06/2025 at 15:19.
Pushed by bruns into branch 'master'.

Handle trailing data entries correctly

Text records may contain extra auxiliary data which should not be fed
to the decompressor.

The existence of such data is signalled by the `extraflags` header field,
and each set bit signals the corresponding extra data which will be
present in all text records.

The entries can be decoded (or removed) by reading the record from the
back. When an entry is present, its size will be at the very end of the
record, preceded by the actual data.
Related: bug 482420, bug 489275

M  +0    -1    autotests/mobipockettest.cpp
M  +62   -3    lib/mobipocket.cpp

https://invent.kde.org/graphics/kdegraphics-mobipocket/-/commit/439a01662e72102e114a46d168fbabbb4de04184