Bug 460882 - Wishlist: KFileMetadata extractor for .eml files
Summary: Wishlist: KFileMetadata extractor for .eml files
Status: CONFIRMED
Alias: None
Product: frameworks-kfilemetadata
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: unspecified
Platform: Neon Linux
: NOR wishlist
Target Milestone: ---
Assignee: Pinak Ahuja
URL:
Keywords: junior-jobs
Depends on:
Blocks:
 
Reported: 2022-10-23 06:14 UTC by tagwerk19
Modified: 2023-11-10 14:49 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description tagwerk19 2022-10-23 06:14:53 UTC
SUMMARY:

    Baloo can index *very* many terms when indexing an .eml, an exported email
    message, file.

STEPS TO REPRODUCE:

    Save an email message (one that contains encoded attachments such as images) to
    a whatever.eml file in a folder being indexed by baloo.

    Wait for it to be indexed and check the extracted data with

        balooshow -x whatever.eml
    
OBSERVED RESULTS

    Likely *very* many indexed terms...

        Internal Info
        Terms: + ++ +++ 0 00 000 0000 003qzxnsptaacsfyy7v7za55g 003vrg9mqpdg5l2qwdeyibwrz
        .... 
        zxdpbmcgq29uzgl0aw9uiglui zxfahckfvzgvgwppcdshla zxfimr20rc5aaj69sbyfxpowo
        zxfl5frdetlhmgrqwwqltgf4u zxhvjklib4iyhh zxhzk9uciwo0e5gcdkdnj zxidunvh5eo5
        zxigd2l0acbwawn0dxjlcybvz zxirdqt1ym5pwnkzsfmfc0djq zxjpzjtjb2xvcjojnem3nkeyi 
        zxjwb2xhdgugdhj1zs9mzw5nd zxk zxk8461lf38mq25nzlaerkcy1 zxknmzleroscscd06vn9ocqec
        xlektnijydyatjnnj4bzl0kl zxlpmhv zxlyoj7dbvyojfbitx7y6 zxm9ha9xqmkhljd29sk
        zxmv6raojowfj5bbxkghstgn1 zxnfcojqsjptsbrjdhmwu2v9l zxnjaaaa zxnqx zxodub5 zxpu
        zxrk9lh8m6t zxrovfd7

    I've seen a test .eml generate 60,000 indexed terms. That is a bit unfair on
    baloo :-/

EXPECTED RESULTS

    Ideally, the extracted plain/text of the body of the email, possibly with RFC822 header
    lines being indexed under specific tags (From:, To:, CC:, Subject:, Date: ?)

SOFTWARE/OS VERSIONS

    Neon Unstable
    Plasma: 5.26.80
    Frameworks: 5.100.0
    Qt: 5.15.6

ADDITIONAL INFORMATION

    I think this will require a kfilemetadata extractor for "message/rfc822" type files.
    There might be established code (munpack?) that can be invoked do the "extraction".
Comment 1 Stefan Brüns 2023-11-10 14:49:52 UTC
A minimal parser should be fairly trivial to write, it just has to handle the header part.