Version: unspecified (using KDE 4.7.0) OS: Linux When I try to use the deep indexing capabilities of NEPOMUK, and try to look for a word containing extended characters, NEPOMUK fails. This is a hard regression, since KDE 4.6 worked well here. Reproducible: Always Steps to Reproduce: 1. Create a file named "Constitución". 2. Index it using Strigi. 3. Try to find it using the KRunner Nepomuk runner. Actual Results: You won't find the file. Expected Results: You find the file. This is NOT bug 271664. I've done all the testing with Virtuoso 6.1.2, precisely to avoid bug 271664.
Tried again, with Virtuoso 6.1.3 + patch fixing bug 271664 upstream. 1. NEPOMUK saves well data with "extended characters". I can even look for it. 2. I can look for contacts with extended characters in KRunner. 3. In the test case above, "Constitución" does not work, but "Constituci?n" works, and finds meaningful results. Looks like a Strigi issue produced by testing the indexing code with a Virtuoso 6.1.3 with bug 271664 unsolved. If that's the case, then this is a Strigi bug. Can anybody confirm?
Confirmed with a patched copy of Virtuoso 6.1.3 and narrowed: this is a NEPOMUK query generator bug. 1. If I look for "constitución" with KRunner, I get nothing. 2. If I look for "constitución" with Dolphin filename search, I get proper results. 3. If I look for "constitución" with Dolphin deep search, I get nothing. 4. If I look for "constitución" using Nepoogle, I get proper results. Also, Nepoogle gives me proper results when the keyword is inside the file, so, this is neither a Strigi indexer bug nor an encoding error; it's the query generator who doesn't know how to handle UTF-8 input.
(In reply to comment #2) > Confirmed with a patched copy of Virtuoso 6.1.3 and narrowed: this is a NEPOMUK > query generator bug. > > 1. If I look for "constitución" with KRunner, I get nothing. > 2. If I look for "constitución" with Dolphin filename search, I get proper > results. > 3. If I look for "constitución" with Dolphin deep search, I get nothing. > 4. If I look for "constitución" using Nepoogle, I get proper results. Also, > Nepoogle gives me proper results when the keyword is inside the file, so, this > is neither a Strigi indexer bug nor an encoding error; it's the query generator > who doesn't know how to handle UTF-8 input. This is actually not a problem of Nepomuk not handling UTF-8 correctly. It does. The problem are the search excerpts. Virtuoso cannot handle non-ascii chars in search excerpt queries. I will investigate on how to solve this.
Git commit 1e85e3cf19b95febeb7b53f42b1a39f900bf6dd4 by Sebastian Trueg. Committed on 30/09/2011 at 17:10. Pushed by trueg into branch 'master'. Convert search excerpt method input to utf8. Virtuoso's bif:search_excerpt does not understand unicode character, thus the input has to be converted to utf8. Hopefully this will be done automatically in a future version of Virtuoso. CCBUG: 281136 M +1 -1 libnepomukcore/query/querybuilderdata_p.h http://commits.kde.org/nepomuk-core/1e85e3cf19b95febeb7b53f42b1a39f900bf6dd4
Git commit fc47a60f182ab33d1c8c73fbb4cd891721bdd6e8 by Sebastian Trueg. Committed on 30/09/2011 at 17:08. Pushed by trueg into branch 'KDE/4.7'. Convert search excerpt method input to utf8. Virtuoso's bif:search_excerpt does not understand unicode character, thus the input has to be converted to utf8. Hopefully this will be done automatically in a future version of Virtuoso. BUG: 281136 M +1 -1 nepomuk/query/querybuilderdata_p.h http://commits.kde.org/kdelibs/fc47a60f182ab33d1c8c73fbb4cd891721bdd6e8
One NEPOMUK bug less. Thanks, Sebastian!
Git commit be0187d23b757eaaf8612d766b6d6a72e4e4f9ff by David Faure, on behalf of Sebastian Trueg. Committed on 30/09/2011 at 17:08. Pushed by dfaure into branch 'frameworks'. Convert search excerpt method input to utf8. Virtuoso's bif:search_excerpt does not understand unicode character, thus the input has to be converted to utf8. Hopefully this will be done automatically in a future version of Virtuoso. BUG: 281136 M +1 -1 nepomuk/query/querybuilderdata_p.h http://commits.kde.org/kdelibs/be0187d23b757eaaf8612d766b6d6a72e4e4f9ff
See http://sourceforge.net/tracker/?func=detail&aid=3418436&group_id=161622&atid=820574 for the upstream Virtuoso bug. This is still required since with the fix unicode chars in search excerpts are simply ignored.
Git commit 9b9502bd5c3e46d922a639b6a9afb6bccd066e15 by Sebastian Trueg. Committed on 07/10/2011 at 11:52. Pushed by trueg into branch 'KDE/4.7'. Now a "real" hacky "fix" for the problem with wide characters in queries. My last "fix" just used bif:charset_recode on the search terms. This, however is no real solution as it makes the query fail if the terms do not contain any wide character. Thus, the problem was simply inverted. How I do it the brute-force way: I simply truncate the terms at the first wide char and use the rest. The search excerpts have the same quality as with the bif:charset_recode hack but it works for wide and non-wide terms. CCBUG: 281136 M +37 -8 nepomuk/query/querybuilderdata_p.h http://commits.kde.org/kdelibs/9b9502bd5c3e46d922a639b6a9afb6bccd066e15
Git commit 2bfb545abd111437cab2a73f7d223312323197e3 by Sebastian Trueg. Committed on 07/10/2011 at 11:47. Pushed by trueg into branch 'master'. Now a "real" hacky "fix" for the problem with wide characters in queries. My last "fix" just used bif:charset_recode on the search terms. This, however is no real solution as it makes the query fail if the terms do not contain any wide character. Thus, the problem was simply inverted. How I do it the brute-force way: I simply truncate the terms at the first wide char and use the rest. The search excerpts have the same quality as with the bif:charset_recode hack but it works for wide and non-wide terms. CCBUG: 281136 M +37 -8 libnepomukcore/query/querybuilderdata_p.h http://commits.kde.org/nepomuk-core/2bfb545abd111437cab2a73f7d223312323197e3