Bug 321478 - Huge titles are assigned for PDF files without title
Summary: Huge titles are assigned for PDF files without title
Status: RESOLVED FIXED
Alias: None
Product: nepomuk
Classification: Miscellaneous
Component: fileindexer (show other bugs)
Version: 4.10.80
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Nepomuk Bugs Coordination
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-06-21 21:24 UTC by Antonio Rojas
Modified: 2013-07-03 23:41 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Antonio Rojas 2013-06-21 21:24:17 UTC
If a PDF file doesn't have a title, the indexer seems to assign some text from the first page as the title. Sometimes all the content from the first page is included in the title, making the "title" field huge, which looks bad in Dolphin

Reproducible: Always

Steps to Reproduce:
Index a PDF with no title
Actual Results:  
The title field is filled with text from the first page

Expected Results:  
No title is set
Comment 1 Vishesh Handa 2013-06-21 21:43:40 UTC
Confirmed.

Do you really think no title should be set? I was thinking of maybe trimming the title to the first 50 or 100 characters.
Comment 2 Antonio Rojas 2013-06-21 21:55:12 UTC
If a PDF file does not have a title set, why should nepomuk try to guess it? Even if it is trimmed, it would probably include some text besides the actual title. It could be confusing when displayed in Dolphin. I think the expected behavior is that the "Title" nepomuk field corresponds to the "Title" field in the PDF file.
Comment 3 Vishesh Handa 2013-06-21 22:05:00 UTC
Well, the reason it was added was that a large number of pdf files do not have titles, and we would still like a title, so we try to infer it from the first page. It works remarkably well for research papers. I'm not too keen on removing this feature.

I can either try to guess the title better, or trim it.
Comment 4 Vishesh Handa 2013-06-25 17:17:31 UTC
Git commit 894661480595e90627bb6a10b2e073648b150758 by Vishesh Handa.
Committed on 25/06/2013 at 11:10.
Pushed by vhanda into branch 'master'.

PopplerExtractor: Trim the guessed title to the first 50 characters

Sometimes the guessed title is just too long, in those cases we try to
trim it to the first 50 characters.

M  +3    -0    services/fileindexer/indexer/popplerextractor.cpp

http://commits.kde.org/nepomuk-core/894661480595e90627bb6a10b2e073648b150758
Comment 5 Antonio Rojas 2013-06-28 21:10:08 UTC
In beta 2, the "guessed" titles are shorter, but they contain many chinese and other UTF8 characters which seem unrelated to the contents of the PDF
Comment 6 Christoph Feck 2013-07-03 23:41:58 UTC
Antonio, could you report it as a separate bug, ideally attaching a small PDF file that shows the issue?