If a PDF file doesn't have a title, the indexer seems to assign some text from the first page as the title. Sometimes all the content from the first page is included in the title, making the "title" field huge, which looks bad in Dolphin Reproducible: Always Steps to Reproduce: Index a PDF with no title Actual Results: The title field is filled with text from the first page Expected Results: No title is set
Confirmed. Do you really think no title should be set? I was thinking of maybe trimming the title to the first 50 or 100 characters.
If a PDF file does not have a title set, why should nepomuk try to guess it? Even if it is trimmed, it would probably include some text besides the actual title. It could be confusing when displayed in Dolphin. I think the expected behavior is that the "Title" nepomuk field corresponds to the "Title" field in the PDF file.
Well, the reason it was added was that a large number of pdf files do not have titles, and we would still like a title, so we try to infer it from the first page. It works remarkably well for research papers. I'm not too keen on removing this feature. I can either try to guess the title better, or trim it.
Git commit 894661480595e90627bb6a10b2e073648b150758 by Vishesh Handa. Committed on 25/06/2013 at 11:10. Pushed by vhanda into branch 'master'. PopplerExtractor: Trim the guessed title to the first 50 characters Sometimes the guessed title is just too long, in those cases we try to trim it to the first 50 characters. M +3 -0 services/fileindexer/indexer/popplerextractor.cpp http://commits.kde.org/nepomuk-core/894661480595e90627bb6a10b2e073648b150758
In beta 2, the "guessed" titles are shorter, but they contain many chinese and other UTF8 characters which seem unrelated to the contents of the PDF
Antonio, could you report it as a separate bug, ideally attaching a small PDF file that shows the issue?