281136 – A NEPOMUK query containing extended characters fails.

Bug 281136 - A NEPOMUK query containing extended characters fails.

Summary: A NEPOMUK query containing extended characters fails.

Status:	RESOLVED FIXED

Alias:	None

Product:	nepomuk
Classification:	Unmaintained
Component:	general (show other bugs)
Version:	unspecified
Platform:	Fedora RPMs Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Sebastian Trueg

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-08-31 20:56 UTC by Alejandro Nova
Modified:	2011-10-07 09:53 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Alejandro Nova 2011-08-31 20:56:12 UTC

Version:           unspecified (using KDE 4.7.0) 
OS:                Linux

When I try to use the deep indexing capabilities of NEPOMUK, and try to look for a word containing extended characters, NEPOMUK fails.

This is a hard regression, since KDE 4.6 worked well here.

Reproducible: Always

Steps to Reproduce:
1. Create a file named "Constitución".
2. Index it using Strigi.
3. Try to find it using the KRunner Nepomuk runner.

Actual Results:  
You won't find the file.

Expected Results:  
You find the file.

This is NOT bug 271664. I've done all the testing with Virtuoso 6.1.2, precisely to avoid bug 271664.

Comment 1 Alejandro Nova 2011-09-14 03:28:51 UTC

Tried again, with Virtuoso 6.1.3 + patch fixing bug 271664 upstream.

1. NEPOMUK saves well data with "extended characters". I can even look for it.
2. I can look for contacts with extended characters in KRunner.
3. In the test case above, "Constitución" does not work, but "Constituci?n" works, and finds meaningful results. Looks like a Strigi issue produced by testing the indexing code with a Virtuoso 6.1.3 with bug 271664 unsolved. If that's the case, then this is a Strigi bug. Can anybody confirm?

Comment 2 Alejandro Nova 2011-09-30 13:06:18 UTC

Confirmed with a patched copy of Virtuoso 6.1.3 and narrowed: this is a NEPOMUK query generator bug.

1. If I look for "constitución" with KRunner, I get nothing.
2. If I look for "constitución" with Dolphin filename search, I get proper results.
3. If I look for "constitución" with Dolphin deep search, I get nothing.
4. If I look for "constitución" using Nepoogle, I get proper results. Also, Nepoogle gives me proper results when the keyword is inside the file, so, this is neither a Strigi indexer bug nor an encoding error; it's the query generator who doesn't know how to handle UTF-8 input.

Comment 3 Sebastian Trueg 2011-09-30 14:21:23 UTC

(In reply to comment #2)
> Confirmed with a patched copy of Virtuoso 6.1.3 and narrowed: this is a NEPOMUK
> query generator bug.
> 
> 1. If I look for "constitución" with KRunner, I get nothing.
> 2. If I look for "constitución" with Dolphin filename search, I get proper
> results.
> 3. If I look for "constitución" with Dolphin deep search, I get nothing.
> 4. If I look for "constitución" using Nepoogle, I get proper results. Also,
> Nepoogle gives me proper results when the keyword is inside the file, so, this
> is neither a Strigi indexer bug nor an encoding error; it's the query generator
> who doesn't know how to handle UTF-8 input.

This is actually not a problem of Nepomuk not handling UTF-8 correctly. It does. The problem are the search excerpts. Virtuoso cannot handle non-ascii chars in search excerpt queries. I will investigate on how to solve this.

Comment 4 Sebastian Trueg 2011-09-30 15:11:05 UTC

Git commit 1e85e3cf19b95febeb7b53f42b1a39f900bf6dd4 by Sebastian Trueg.
Committed on 30/09/2011 at 17:10.
Pushed by trueg into branch 'master'.

Convert search excerpt method input to utf8.

Virtuoso's bif:search_excerpt does not understand unicode character,
thus the input has to be converted to utf8.
Hopefully this will be done automatically in a future version of
Virtuoso.

CCBUG: 281136

M  +1    -1    libnepomukcore/query/querybuilderdata_p.h

http://commits.kde.org/nepomuk-core/1e85e3cf19b95febeb7b53f42b1a39f900bf6dd4

Comment 5 Sebastian Trueg 2011-09-30 15:11:25 UTC

Git commit fc47a60f182ab33d1c8c73fbb4cd891721bdd6e8 by Sebastian Trueg.
Committed on 30/09/2011 at 17:08.
Pushed by trueg into branch 'KDE/4.7'.

Convert search excerpt method input to utf8.

Virtuoso's bif:search_excerpt does not understand unicode character,
thus the input has to be converted to utf8.
Hopefully this will be done automatically in a future version of
Virtuoso.

BUG: 281136

M  +1    -1    nepomuk/query/querybuilderdata_p.h

http://commits.kde.org/kdelibs/fc47a60f182ab33d1c8c73fbb4cd891721bdd6e8

Comment 6 Alejandro Nova 2011-10-01 00:21:48 UTC

One NEPOMUK bug less. Thanks, Sebastian!

Comment 7 David Faure 2011-10-04 14:47:55 UTC

Git commit be0187d23b757eaaf8612d766b6d6a72e4e4f9ff by David Faure, on behalf of Sebastian Trueg.
Committed on 30/09/2011 at 17:08.
Pushed by dfaure into branch 'frameworks'.

Convert search excerpt method input to utf8.

Virtuoso's bif:search_excerpt does not understand unicode character,
thus the input has to be converted to utf8.
Hopefully this will be done automatically in a future version of
Virtuoso.

BUG: 281136

M  +1    -1    nepomuk/query/querybuilderdata_p.h

http://commits.kde.org/kdelibs/be0187d23b757eaaf8612d766b6d6a72e4e4f9ff

Comment 8 Sebastian Trueg 2011-10-05 13:57:44 UTC

See
http://sourceforge.net/tracker/?func=detail&aid=3418436&group_id=161622&atid=820574 for the upstream Virtuoso bug.
This is still required since with the fix unicode chars in search excerpts are simply ignored.

Comment 9 Sebastian Trueg 2011-10-07 09:53:08 UTC

Git commit 9b9502bd5c3e46d922a639b6a9afb6bccd066e15 by Sebastian Trueg.
Committed on 07/10/2011 at 11:52.
Pushed by trueg into branch 'KDE/4.7'.

Now a "real" hacky "fix" for the problem with wide characters in queries.

My last "fix" just used bif:charset_recode on the search terms. This,
however is no real solution as it makes the query fail if the terms
do not contain any wide character. Thus, the problem was simply
inverted.
How I do it the brute-force way: I simply truncate the terms at the
first wide char and use the rest. The search excerpts have the same
quality as with the bif:charset_recode hack but it works for wide and
non-wide terms.

CCBUG: 281136

M  +37   -8    nepomuk/query/querybuilderdata_p.h

http://commits.kde.org/kdelibs/9b9502bd5c3e46d922a639b6a9afb6bccd066e15

Comment 10 Sebastian Trueg 2011-10-07 09:53:57 UTC

Git commit 2bfb545abd111437cab2a73f7d223312323197e3 by Sebastian Trueg.
Committed on 07/10/2011 at 11:47.
Pushed by trueg into branch 'master'.

Now a "real" hacky "fix" for the problem with wide characters in queries.

My last "fix" just used bif:charset_recode on the search terms. This,
however is no real solution as it makes the query fail if the terms
do not contain any wide character. Thus, the problem was simply
inverted.
How I do it the brute-force way: I simply truncate the terms at the
first wide char and use the rest. The search excerpts have the same
quality as with the bif:charset_recode hack but it works for wide and
non-wide terms.

CCBUG: 281136

M  +37   -8    libnepomukcore/query/querybuilderdata_p.h

http://commits.kde.org/nepomuk-core/2bfb545abd111437cab2a73f7d223312323197e3