Bug 464226

Summary: Baloo and Nulls
Product: [Frameworks and Libraries] frameworks-baloo Reporter: tagwerk19
Component: Baloo File DaemonAssignee: baloo-bugs-null
Status: RESOLVED FIXED    
Severity: major    
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Other   
OS: Linux   
Latest Commit: Version Fixed In:
Attachments: Text file containing a \000

Description tagwerk19 2023-01-13 08:03:54 UTC
Created attachment 155254 [details]
Text file containing a \000

SUMMARY:
    Baloo seems to stumble when it meets a "null" character in a text file.

    A parallel or more general case of:

        https://invent.kde.org/frameworks/baloo/-/merge_requests/87

STEPS TO REPRODUCE:
    Download the test file into an indexed folder. The file contains:

        -> ^@ <-

    where the ^@ is a "null" byte. Ask baloo what it has as the indexed data:

        $ balooshow -x file-with-a-000.txt

OBSERVED RESULTS:
    You get:

        1625990000fc01 64513 1451417 file-with-a-000.txt [/home/test/Documents/file-with-a-000.txt]
                Mtime: 1673373876 2023-01-10T18:04:36
                Ctime: 1673373876 2023-01-10T18:04:36
                Cached properties:
                        Line Count: 1

        Internal Info
        Terms:   < > Mplain Mtext T5 T8 X20-1
        File Name Terms: F000 Fa Ffile Ftxt Fwith
        XAttr Terms:
        Internal Error - malformed term (short): ''
        Internal Error - malformed term (short): ''
        lineCount: 1

EXPECTED RESULTS:

        Internal Info
        Terms:   < > Mplain Mtext T5 T8 X20-1
        File Name Terms: F000 Fa Ffile Ftxt Fwith
        XAttr Terms:
        lineCount: 1

ADDITIONAL INFORMATION
    Igor Poboiko's "baloo-checkdb.py" script:

        https://invent.kde.org/frameworks/baloo/uploads/bdc9f5f17fc96490b7bd4a22ac664843/baloo-checkdb.py

    gives a couple of errors:

        ...
        Checking whether posting[docterms[docid]] contains docid (can take some time)...
        ERROR: 6236232384314369 (/home/test/Documents/file-with-a-000.txt) has term  which wasn't found in PostingDB
        ERROR: 6236232384314369 (/home/test/Documents/file-with-a-000.txt) has term  which wasn't found in PostingDB
        ...

    and the merge request mentions

        ... TermGenerator then generates proper (yet meaningless) terms out of those
        characters, and they end up in database ...

    In this case it's happening for a "null" in a text file rather than a problematic
    PDF. I think it should *not* be possible for a file to corrupt the database.
    A worry might be that a "specially crafted" file could perform mischief and flagging
    as "major" because of this.
Comment 2 tagwerk19 2023-04-19 13:48:56 UTC
Should arrive with Frameworks 5.105
Comment 3 tagwerk19 2023-04-22 10:14:08 UTC
Flagging Resolved/Fixed