Bug 477533 - Error while reading database file when Category name starts with numbers
Summary: Error while reading database file when Category name starts with numbers
Status: CONFIRMED
Alias: None
Product: kphotoalbum
Classification: Applications
Component: Backend (show other bugs)
Version: GIT master
Platform: openSUSE Linux
: NOR normal
Target Milestone: ---
Assignee: KPhotoAlbum Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-26 00:01 UTC by Victor Lobo
Modified: 2025-01-03 18:22 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Victor Lobo 2023-11-26 00:01:53 UTC
SUMMARY
Error while reading database file when Category name starts with numbers

Error while reading database file
An error was encountered on line 83, column 184:Error reading next element
Additional error information:Expected '>' or '/', but got '[0-9]'.
Database path: ‘/tmp/kphotoalbum-demo-victor/index.xml’


STEPS TO REPRODUCE
1. Open KPhotoAlbum (kphotoalbum --demo)
2. From the main menu go to Settings > Configure KPhotoAlbum... > Categories
3. (Save the DB now if needed) and Add a new category that starts with one or more digit in [0-9] (for example 2Review)
4. Click OK to close the Settings window and save the category
5. Go to the Annotations window of any image or video and create and assign a new tag for category 2Review
6. Save the database
8. Close KPhotoAlbum (if in demo remember to NOT delete the database)
9. Reopen KPhotoAlbum (kphotoalbum --demo)
10. Error while reading database file


Here new category has been created but same error occurs if existing categories (such as Places) are renamed to start with a number

OBSERVED RESULT
Error while reading database file

EXPECTED RESULT
Depends on the design; either not allow categories starting with digits or write correctly to the xml or read correctly from the xml... etc.


SOFTWARE/OS VERSIONS
Linux: openSUSE Tumbleweed 20230906; Kernel Version: 6.4.12-1-default (64-bit)
KDE Plasma Version: 5.27.7
KDE Frameworks Version: 5.109.0
Qt Version: 5.15.10
KPhotoAlbum v5.11 and
KPhotoAlbum development version latest git master v5.12.0
Comment 1 Johannes Zarl-Zierl 2023-11-26 20:59:21 UTC
The issue here lies with the "compressed" file format using the category name as an XML attribute:

> <category name="2Review" ...>
> ....
> <image file="..." 2Review="test" ...>

That also means that one can create and use such a category in the "uncompressed" file format perfectly well and than convert the database to the "compressed" file format and totally break it.

To fix this on a database level, we need to change the database file format and break compatibility.
Alternatively, we could just prevent creating such a category - but that seems like a band-aid solution to me.

Since we are breaking compatibility anyways, I'm contemplating decoupling between the category name and the XML attribute name to avoid the whole escaping/unescaping mechanism altogether.

I'll need a few days to think about the which solution will be best to implement...
Comment 2 Tobias Leupold 2023-11-27 07:56:22 UTC
What about the following: We check all category names. Is we have one starting with a number, we force the "uncompressed" format, and we're done. Maybe we inform the user about that fallback (with a "don't show again" message box).

Actually, we mix both variants anyway as soon as tagged areas are present.

I'm not sure if our users actually know about those two variants and consciously choose one … and if the benefit is really worth keeping and maintaining both …
Comment 3 Tobias Leupold 2025-01-03 18:22:36 UTC
Meanwhile, I know what's going on here, and I also think I know how to fix this :-)

When the "compressed" file format is used, category names are used as XML attributes. To be able to do so, they are escaped. Our current escaping algorithm produces invalid XML attribute names, depending on the input: It (among other flaws) allows numbers to be the first character of the escaped output.

This violates the XML spec (cf. https://www.w3.org/TR/xml/ ), which states that the first character of an XML attribute must be a NameStartChar. That is "a-z", "A-Z", ":" or "_". Numbers are allowed later in the attribute name, but not as the first character.

When writing the XML file, the non-compliant attribute name is written nevertheless. When re-opening the database later, the data can't be read anymore though, because the parser finds a number where he expects either the end of the tag ("/>" or ">") or a new attribute (a NameStartChar), cf. the posted error message: "Expected '>' or '/', but got '[0-9]'" – and thus fails on the invalid XML.

Just as a side note: The algorithm also can't escape non-Latin-1 characters correctly (they become "?"), and we also have problems with category names containing spaces and underscores when using the "readable" format, which aren't unescaped to what they initially were (all underscores are replaced by spaces and the underscores are lost on the next reading).

The only way to fix the root cause for this is to implement a new escaping algorithm to escape category names to be used as XML attributes that respects the XML spec.

My proposal for a compliant implementation can be found at https://invent.kde.org/graphics/kphotoalbum/-/tree/safe_xml_escaping?ref_type=heads – I use a modified URL-style percent encoding using QByteArray's integrated functionality. With this approach, not only the numbers issue is fixed, but one can also use the whole Unicode range in a category name. Also, the spaces and underscores issue is gone for the "readable" format.

Needs testing though.