Summary: | QMobipocket fails to decompress Mobipocket version >= 5 | ||
---|---|---|---|
Product: | [Frameworks and Libraries] kdegraphics-mobipocket | Reporter: | Piotr Keplicz <keplicz> |
Component: | general | Assignee: | Unassigned bugs mailing-list <unassigned-bugs> |
Status: | CONFIRMED --- | ||
Severity: | crash | CC: | goran.grbic, harveyrasp, hoyanmok, kde, lukas, mreich1978, nate, rhodry47, stefan.bruens, tagwerk19 |
Priority: | HI | Keywords: | drkonqi |
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Neon | ||
OS: | Linux | ||
See Also: |
https://bugs.kde.org/show_bug.cgi?id=475730 https://bugs.kde.org/show_bug.cgi?id=489275 |
||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: |
Description
Piotr Keplicz
2023-10-22 17:27:13 UTC
(In reply to Piotr Keplicz from comment #0) > #15 0x00007f440c5bfb57 in KFileMetaData::MobiExtractor::extract (this=<optimized out>, result=0x7ffd754def70) at ./src/extractors/mobiextractor.cpp:96 See Bug 475730... If you can identify the .mobi file, maybe exclude it from indexing (or convert to ePub?). Not found a validator for .mobi Might be worth checking that .mobi indexing is working "in general". I've just tested the samples on https://filesamples.com/formats/mobi on Neon and Fedora 38. Seems OK. The offending file is a Polish dictionary: https://xn--zabaaganionemiejsce-8fd.pl/cc-sjp/SJP2-202302181714.mobi (In reply to Piotr Keplicz from comment #3) > The offending file is a Polish dictionary: That's likely to be a challenge :-) Baloo will try to set up a record in the index for every word in the dictionary, with a link to its location in the file. That will be a BIG transaction. I see baloo_file_extractor start indexing the file but slow to a crawl, due to the "MemoryHigh=512MB" cap on RAM usage in the kde-baloo.service unit file. Swap usage goes up quickly, presumably dirty pages waiting to be committed. A personal view, I don't think baloo should use swap... If I change the systemd limits to: MemoryHigh=50% MemorySwapMax=0B and give my test VM 16GB to work in, I see the baloo_file_extractor crash. Tested on Fedora38 and Neon User. Confirming. If I try on Neon Unstable, it looks as if baloo skips the content indexing of the file. A $ balooshow -x SJP2-202302181714.mobi just gives... 141b30ed0da2dd 3977093853 1317680 SJP2-202302181714.mobi [/home/test/Testdir/SJP2-202302181714.mobi] Mtime: 1698129246 2023-10-24T08:34:06 Ctime: 1698129246 2023-10-24T08:34:06 Internal Info File Name Terms: F202302181714 Fmobi Fsjp2 XAttr Terms: Plain Text Terms: Property Terms: Mapplication Mebook Mmobipocket Mx T5 It seems possible to convert the SJP2-202302181714.mobi to an epub with Calibre: $ ebook-convert SJP2-202302181714.mobi SJP2-202302181714.epub --dont-split-on-page-breaks Conversion options changed from defaults: dont_split_on_page_breaks: True 1% Converting input to HTML... InputFormatPlugin: MOBI Input running on /home/test/Downloads/SJP2-202302181714.mobi Malformed markup, parsing using html5-parser Parsing all content... HTML 5 parsing failed, falling back to older parsers Forcing index.html into XHTML namespace Generating default TOC from spine... 34% Running transforms on e-book... Merging user specified metadata... Detecting structure... Auto generated TOC with 0 entries. Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Removing fake margins... Cleaning up manifest... Trimming unused files from manifest... Trimming 'images/00003.jpg' from manifest Trimming 'images/00001.jpg' from manifest Creating EPUB Output... 67% Running EPUB Output plug-in Splitting markup on page breaks and flow limits, if any... Looking for large trees in index.html... Found large tree #0 Split into 17 parts This EPUB file has no Table of Contents. Creating a default TOC The cover image has an id != "cover". Renaming to work around bug in Nook Color EPUB output written to /home/test/Downloads/SJP2-202302181714.epub Output saved to /home/test/Downloads/SJP2-202302181714.epub Without the "dont-split-on-page-breaks" option, the process stuck at the 67% mark. There's a comment about "Malformed markup" and then a "HTML5 parsing failed" but whether these are enough to break Baloo's mobiextractor code... Nevertheless Baloo can index the epub, maybe this is good enough. Thanks for help :) I have compared the raw html output from QMobipocket and https://github.com/iscc/mobi, and apparently the output from QMobipocket is fairly screwed up. One likely source for this screwup is the missing treatment of trailing data in each PDB section, which exists since Mobipocket version 5. Unfortunately, QMobipocket is essentially unmaintained for 10 years now ... There is no product for https://invent.kde.org/graphics/kdegraphics-mobipocket, thus assigning to "kde". https://wiki.mobileread.com/wiki/MOBI#Trailing_entries This is implemented in the last FBreader GPLed sources (now closed source), Calibre, python-mobi and various others. *** Bug 475730 has been marked as a duplicate of this bug. *** Created a Bugzilla product for it; moving there. *** Bug 488587 has been marked as a duplicate of this bug. *** *** Bug 487481 has been marked as a duplicate of this bug. *** *** Bug 486853 has been marked as a duplicate of this bug. *** *** Bug 489612 has been marked as a duplicate of this bug. *** *** Bug 490210 has been marked as a duplicate of this bug. *** *** Bug 490446 has been marked as a duplicate of this bug. *** |