| Summary: | Booking.com confirmation pdf not parsed | ||
|---|---|---|---|
| Product: | [Applications] KDE Itinerary | Reporter: | Gergely HORVÁTH <horvathg.1988> |
| Component: | general | Assignee: | Volker Krause <vkrause> |
| Status: | REPORTED --- | ||
| Severity: | normal | ||
| Priority: | NOR | ||
| Version First Reported In: | unspecified | ||
| Target Milestone: | --- | ||
| Platform: | Other | ||
| OS: | Linux | ||
| Latest Commit: | Version Fixed/Implemented In: | ||
| Sentry Crash Report: | |||
|
Description
Gergely HORVÁTH
2024-07-11 20:48:14 UTC
PDF created by printing a website are currently unfortunately not supported, the PDFs we can import are only those generated by providers themselves. Is it because the generated pdfs contain some additional metadata/tags that makes it easier/possible to parse or because the position of various data after printing is not constant? Moreover, are you using some metadata or searching for patterns on the page during parsing? The main challenge with browser-generated PDFs is that line and page breaks don't tend to be stable, resulting in many more variations that need to be handled. That doesn't mean extracting from them is impossible, but you can usually only rely on textual content not the structure or layout of it. For extracting other PDF we tend to use a mix of textual and structural approaches. Metadata beyond that is rare in PDFs (eg. the author/creator fields), we use that e.g. for determining the correct extractor scripts when properly set (which is more efficient than doing that based on the content). The best case are (large) 2D barcodes in PDFs, like found on flight or train tickets, but those should work in PDFs printed from websites already as well. |