Bug 490127 - Booking.com confirmation pdf not parsed
Summary: Booking.com confirmation pdf not parsed
Status: REPORTED
Alias: None
Product: KDE Itinerary
Classification: Applications
Component: general (other bugs)
Version First Reported In: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Volker Krause
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-11 20:48 UTC by Gergely HORVÁTH
Modified: 2024-07-19 08:01 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gergely HORVÁTH 2024-07-11 20:48:14 UTC
SUMMARY

Currently I cannot find a way to import bookings from booking.com. Emails cannot be parsed (#454018), the link arriving via text message doesn't work (not so surprising), and also pdf is not parsed. This issue is about the pdf parsing.

STEPS TO REPRODUCE
1.  Make a booking on booking.com
2.  Print the full confirmation into a pdf
3.  Try to import the pdf into KDE Itinerary

OBSERVED RESULT
"Found nothing to import."

EXPECTED RESULT
Import the booking.

SOFTWARE/OS VERSIONS

Tested on LineagOS 21, (2024-07-04 OnePlus6 build)
F-Droid version, 24.07.70 (557e499)

Tested also on Arch Linux, version 24.05.1-1
Operating System: Arch Linux 
KDE Plasma Version: 6.1.1
KDE Frameworks Version: 6.3.0
Qt Version: 6.7.2
Kernel Version: 6.9.7-arch1-1 (64-bit)
Graphics Platform: Wayland
Processors: 8 × 11th Gen Intel® Core™ i7-11370H @ 3.30GHz
Memory: 31,1 GiB of RAM
Graphics Processor: Mesa Intel® Xe Graphics
Manufacturer: TUXEDO
Product Name: TUXEDO InfinityBook Pro 14 Gen6

I can attach earlier confirmations, if that helps. Currently I only have Hungarian, but in a couple of weeks I can send also an English one.

Keep up the great work!
Best,
Gergely
Comment 1 Volker Krause 2024-07-15 15:42:02 UTC
PDF created by printing a website are currently unfortunately not supported, the PDFs we can import are only those generated by providers themselves.
Comment 2 Gergely HORVÁTH 2024-07-16 18:39:34 UTC
Is it because the generated pdfs contain some additional metadata/tags that makes it easier/possible to parse or because the position of various data after printing is not constant? Moreover, are you using some metadata or searching for patterns on the page during parsing?
Comment 3 Volker Krause 2024-07-19 08:01:01 UTC
The main challenge with browser-generated PDFs is that line and page breaks don't tend to be stable, resulting in many more variations that need to be handled. That doesn't mean extracting from them is impossible, but you can usually only rely on textual content not the structure or layout of it.

For extracting other PDF we tend to use a mix of textual and structural approaches. Metadata beyond that is rare in PDFs (eg. the author/creator fields), we use that e.g. for determining the correct extractor scripts when properly set (which is more efficient than doing that based on the content). The best case are (large) 2D barcodes in PDFs, like found on flight or train tickets, but those should work in PDFs printed from websites already as well.