Bug 460882

Summary: Wishlist: KFileMetadata extractor for .eml files
Product: [Frameworks and Libraries] frameworks-kfilemetadata Reporter: tagwerk19
Component: generalAssignee: Pinak Ahuja <pinak.ahuja>
Status: CONFIRMED ---    
Severity: wishlist CC: nate, stefan.bruens
Priority: NOR Keywords: junior-jobs
Version: unspecified   
Target Milestone: ---   
Platform: Neon   
OS: Linux   
See Also: https://bugs.kde.org/show_bug.cgi?id=447681
Latest Commit: Version Fixed In:

Description tagwerk19 2022-10-23 06:14:53 UTC
SUMMARY:

    Baloo can index *very* many terms when indexing an .eml, an exported email
    message, file.

STEPS TO REPRODUCE:

    Save an email message (one that contains encoded attachments such as images) to
    a whatever.eml file in a folder being indexed by baloo.

    Wait for it to be indexed and check the extracted data with

        balooshow -x whatever.eml
    
OBSERVED RESULTS

    Likely *very* many indexed terms...

        Internal Info
        Terms: + ++ +++ 0 00 000 0000 003qzxnsptaacsfyy7v7za55g 003vrg9mqpdg5l2qwdeyibwrz
        .... 
        zxdpbmcgq29uzgl0aw9uiglui zxfahckfvzgvgwppcdshla zxfimr20rc5aaj69sbyfxpowo
        zxfl5frdetlhmgrqwwqltgf4u zxhvjklib4iyhh zxhzk9uciwo0e5gcdkdnj zxidunvh5eo5
        zxigd2l0acbwawn0dxjlcybvz zxirdqt1ym5pwnkzsfmfc0djq zxjpzjtjb2xvcjojnem3nkeyi 
        zxjwb2xhdgugdhj1zs9mzw5nd zxk zxk8461lf38mq25nzlaerkcy1 zxknmzleroscscd06vn9ocqec
        xlektnijydyatjnnj4bzl0kl zxlpmhv zxlyoj7dbvyojfbitx7y6 zxm9ha9xqmkhljd29sk
        zxmv6raojowfj5bbxkghstgn1 zxnfcojqsjptsbrjdhmwu2v9l zxnjaaaa zxnqx zxodub5 zxpu
        zxrk9lh8m6t zxrovfd7

    I've seen a test .eml generate 60,000 indexed terms. That is a bit unfair on
    baloo :-/

EXPECTED RESULTS

    Ideally, the extracted plain/text of the body of the email, possibly with RFC822 header
    lines being indexed under specific tags (From:, To:, CC:, Subject:, Date: ?)

SOFTWARE/OS VERSIONS

    Neon Unstable
    Plasma: 5.26.80
    Frameworks: 5.100.0
    Qt: 5.15.6

ADDITIONAL INFORMATION

    I think this will require a kfilemetadata extractor for "message/rfc822" type files.
    There might be established code (munpack?) that can be invoked do the "extraction".
Comment 1 Stefan BrĂ¼ns 2023-11-10 14:49:52 UTC
A minimal parser should be fairly trivial to write, it just has to handle the header part.