321478 – Huge titles are assigned for PDF files without title

Bug 321478 - Huge titles are assigned for PDF files without title

Summary: Huge titles are assigned for PDF files without title

Status:	RESOLVED FIXED

Alias:	None

Product:	nepomuk
Classification:	Unmaintained
Component:	fileindexer (show other bugs)
Version:	4.10.80
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Nepomuk Bugs Coordination

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-06-21 21:24 UTC by Antonio Rojas
Modified:	2013-07-03 23:41 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:	http://commits.kde.org/nepomuk-core/894661480595e90627bb6a10b2e073648b150758
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Antonio Rojas 2013-06-21 21:24:17 UTC

If a PDF file doesn't have a title, the indexer seems to assign some text from the first page as the title. Sometimes all the content from the first page is included in the title, making the "title" field huge, which looks bad in Dolphin

Reproducible: Always

Steps to Reproduce:
Index a PDF with no title
Actual Results:  
The title field is filled with text from the first page

Expected Results:  
No title is set

Comment 1 Vishesh Handa 2013-06-21 21:43:40 UTC

Confirmed.

Do you really think no title should be set? I was thinking of maybe trimming the title to the first 50 or 100 characters.

Comment 2 Antonio Rojas 2013-06-21 21:55:12 UTC

If a PDF file does not have a title set, why should nepomuk try to guess it? Even if it is trimmed, it would probably include some text besides the actual title. It could be confusing when displayed in Dolphin. I think the expected behavior is that the "Title" nepomuk field corresponds to the "Title" field in the PDF file.

Comment 3 Vishesh Handa 2013-06-21 22:05:00 UTC

Well, the reason it was added was that a large number of pdf files do not have titles, and we would still like a title, so we try to infer it from the first page. It works remarkably well for research papers. I'm not too keen on removing this feature.

I can either try to guess the title better, or trim it.

Comment 4 Vishesh Handa 2013-06-25 17:17:31 UTC

Git commit 894661480595e90627bb6a10b2e073648b150758 by Vishesh Handa.
Committed on 25/06/2013 at 11:10.
Pushed by vhanda into branch 'master'.

PopplerExtractor: Trim the guessed title to the first 50 characters

Sometimes the guessed title is just too long, in those cases we try to
trim it to the first 50 characters.

M  +3    -0    services/fileindexer/indexer/popplerextractor.cpp

http://commits.kde.org/nepomuk-core/894661480595e90627bb6a10b2e073648b150758

Comment 5 Antonio Rojas 2013-06-28 21:10:08 UTC

In beta 2, the "guessed" titles are shorter, but they contain many chinese and other UTF8 characters which seem unrelated to the contents of the PDF

Comment 6 Christoph Feck 2013-07-03 23:41:58 UTC

Antonio, could you report it as a separate bug, ideally attaching a small PDF file that shows the issue?