Bug 333037

Summary: Dolphin does not return every filename, in-which search strings (terms) are partially contained, even when wildcards are in use
Product: [Frameworks and Libraries] frameworks-baloo Reporter: mau <b-misc>
Component: EngineAssignee: Pinak Ahuja <pinak.ahuja>
Status: REOPENED ---    
Severity: wishlist CC: alainbidon.anderlini, alexis.gm, aspotashev, bugseforuns, cjacker, gekylafas, gorilych, hein, ict, jonnymo, kdeuser56, leodream2008, nate, nujucobian, pinak.ahuja, s.kenn, tommi.nieminen, ukyoi, yannis.zip
Priority: NOR    
Version: 5.16.0   
Target Milestone: ---   
Platform: Kubuntu   
OS: Linux   
See Also: https://bugs.kde.org/show_bug.cgi?id=362647
Latest Commit: Version Fixed In:
Attachments: Terminal Output
Dolphin output
Dolphin output
Dolphin baloo-search
Dolphin baloo-search

Description mau 2014-04-03 17:09:47 UTC
If you have a file named testfile, you can find it searching for "testf", but not searching for "esttfile". This is not really a bug, but might nevertheless surprise some users.

Reproducible: Always
Comment 1 Vishesh Handa 2014-04-04 08:16:13 UTC
I'm sorry, but this just cannot be fixed. Searching for words from the middle would require massive changes in Baloo. It would also result in us having to to do double the amount of work - so double the cpu + io usage, double the disk space used. And the complexity of the code would also increase.

I'm not inclined on fixing this. No one else provides searching through the middle of a word.
Comment 2 mau 2014-04-04 08:35:13 UTC
Sorry, but it is not true that no one else provides this kind of search functionality: In Windows it works, but only if you type a * before your search term. In Dolphin this doesn't work.
Comment 3 Vishesh Handa 2014-04-04 08:45:30 UTC
Alright. I'll keep this open for now. Though I'm really not sure how to implement this efficiently. Maybe someone else knows a better way.

In Windows, when adding a '*', does that work for both content and the filename?
Comment 4 mau 2014-04-04 09:00:54 UTC
For filenames it works; for the content I don't have the impression that this kind of search really works...

IMHO searching for partial strings is more important for filenames than for the content.
Comment 5 Vishesh Handa 2014-04-04 10:31:12 UTC
Hmm, for filenames this can be possibly done. I'll see. But don't hold your breath.
Comment 6 Vishesh Handa 2014-07-09 09:40:52 UTC
Git commit f6d91a14b792852b71cdabf6769c4e36cfc3f8ee by Vishesh Handa.
Committed on 09/07/2014 at 09:46.
Pushed by vhanda into branch 'frameworks'.

FileSearch: Allow searching with wildcards in the filename

With this when searching explicitly for the property "filename", the
value can except some standard glob terms such as "*", "." and "?".
These globs can appear anywhere. This operation is NOT cheap, and it is
implemented by running a regular expression over each filename in the
database. It should only be used as a last resort.

This does not solve this bug as this is only activated when explicitly
searching for the filename. We still need to hook this up to the
frontend for the user to be able to choose "search in filename". This is
done in dolphin, but it might be nice to do it in other places such as
"krunner" as well.

M  +4    -1    src/file/search/CMakeLists.txt
M  +1    -0    src/file/search/autotest/CMakeLists.txt
M  +73   -0    src/file/search/autotest/filesearchstoretest.cpp
M  +2    -0    src/file/search/autotest/filesearchstoretest.h
M  +6    -0    src/file/search/filesearchstore.cpp
A  +124  -0    src/file/search/wildcardpostingsource.cpp     [License: LGPL (v2.1+)]
A  +64   -0    src/file/search/wildcardpostingsource.h     [License: LGPL (v2.1+)]

http://commits.kde.org/baloo/f6d91a14b792852b71cdabf6769c4e36cfc3f8ee
Comment 7 Ukyoi 2014-07-13 09:09:48 UTC
This issue is a lot serious in CJK (Chinese, Japanses, Korean) world. It makes baloo even *unusable*.
CKJ languages don't use spaces to seperate two words. For example, if I have a file named 桌面搜索 (means desktop search) and I can only remember 搜索 (search) but not 桌面 (desktop), I'm unable to find it with baloo on. Since baloo have only one (none or all) toggle (on KDE 4.13), I have to turn off the whole baloo to ensure the searching function working properly.

As a workaround, would it be possible that providing an option for allowing traditional search function run automatically after baloo search?
Comment 8 kdeuser56 2014-09-21 11:14:52 UTC
Vishesh, you have done great work with baloo, but I agree with the others that this feature is vital. Using grep is much more useful atm. If this operation would be that more expensive try to add an option like "partial words".
Comment 9 Vishesh Handa 2014-09-21 12:40:19 UTC
Update report -

We have been working on a user visible query parser which has a simple syntax such as "title:Foo   OR artist:Coldplay". It also supports some basic parenthesis. The idea is to expose this query language everywhere where Baloo is used. This would include krunner as well.

So, you should be able to do "filename:foo". This would find all files which begin with foo. If you wanted to handle middle characters you can use wildcards. "filename:*foo". Any combination of * can be used. This extra * based approach is more expensive so it might be about 100-500 msecs slower. Depends on your database size.

This * based approach is currently only implemented for filenames. We can easily extended it to others though. I'll close this bug when the krunner uses this parser. I want to get that done in time for Plasma 5.1.
Comment 10 mau 2014-09-30 14:44:07 UTC
I already read about the query parser which seems to be really nice - for more sohisticated users/queries. But does that mean that one has to enter "filename:myFileName" everytime or does krunner still work with "myFileName" alone?
Comment 11 Vishesh Handa 2014-09-30 22:11:44 UTC
(In reply to mau from comment #10)
> I already read about the query parser which seems to be really nice - for
> more sohisticated users/queries. But does that mean that one has to enter
> "filename:myFileName" everytime or does krunner still work with "myFileName"
> alone?

KRunner will still work with "myFileName" as well.
Comment 12 alexis.gm 2014-10-01 23:05:54 UTC
I believe that this functionality was working previously, before Baloo was used (so, Nepomunk?).   For example, I had a folder full of file names like:

quiz1_w2014
quiz1_f2012
quiz3_w2011
quiz1_w2014

I used to be able to just search for "w2014" and get all the relevant quizzes.   The way that search is now implemented is almost useless (in my opinion).   It is very frustrating that search worked perfectly before but now with baloo there is this huge regression in functionality.   I've pretty much given up on searching from within dolphin, and use the terminal to " find . -n *w2011* -R" which is certainly not very handy, but at least it works.

As other people have mentioned, searching with wildcards works perfectly in other operating systems.
Comment 13 alexis.gm 2014-10-01 23:13:22 UTC
Sorry, I messed up my previous comment... the search command should have been "find . -name *w2011*"
Comment 14 Vishesh Handa 2014-10-02 13:23:24 UTC
Hey Alexis

(In reply to alexis.gm from comment #12)
> I believe that this functionality was working previously, before Baloo was
> used (so, Nepomunk?).   For example, I had a folder full of file names like:
> 
> quiz1_w2014
> quiz1_f2012
> quiz3_w2011
> quiz1_w2014
> 
> I used to be able to just search for "w2014" and get all the relevant
> quizzes.   The way that search is now implemented is almost useless (in my
> opinion). 

With Plasma 5.1, just searching for "w2014" will work. Before 5.1, we did not take _ into consideration.

Wilcard searching has also been implemented for filenames from 5.1, it just is not yet exposed in KRunner. It will definitely be there in 5.2.
Comment 15 Eike Hein 2015-01-18 15:20:29 UTC
Vishesh, I hope you don't mind me barging in here, but from skimming the report - does Baloo still do anything in the way of building word indices based on using whitespace as word boundaries, or does it properly use QTextBoundaryFinder these days?
Comment 16 Vishesh Handa 2015-01-19 17:17:03 UTC
(In reply to Eike Hein from comment #15)
> Vishesh, I hope you don't mind me barging in here, but from skimming the
> report - does Baloo still do anything in the way of building word indices
> based on using whitespace as word boundaries, or does it properly use
> QTextBoundaryFinder these days?

Not at all Eike. It uses QTextBoundaryFinder since Plasma 5.1. Previously we were relying on Xapian's internal splitting, which was not ideal.
Comment 17 Eike Hein 2015-01-19 17:24:49 UTC
Good :)
Comment 18 alexis.gm 2015-02-04 10:38:23 UTC
I was looking forward to plasma 5.2 in order to see if the search was working better.   Unfortunately, now search doesn't work at all anymore.   

For example, I open Dolphin, and while I'm at my home directory, I try to search for "Desktop" and I get no result.   I try to search for any file or folder name and nothing comes up.   Yet, the "filter" tool at the bottom can correctly find things that are in the current working directory.   

I'm using a fresh install of Kubuntu 14.10, on which I installed plasma 5.2.
Comment 19 Vishesh Handa 2015-02-04 11:32:31 UTC
(In reply to alexis.gm from comment #18)
> I was looking forward to plasma 5.2 in order to see if the search was
> working better.   Unfortunately, now search doesn't work at all anymore.   
> 
> For example, I open Dolphin, and while I'm at my home directory, I try to
> search for "Desktop" and I get no result.   I try to search for any file or
> folder name and nothing comes up.   Yet, the "filter" tool at the bottom can
> correctly find things that are in the current working directory.   
> 
> I'm using a fresh install of Kubuntu 14.10, on which I installed plasma 5.2.

Please use `baloosearch` and report your findings. Dolphin uses kio, which then uses baloo. It could be any one of those things going wrong. Also, Dolphin is currently still using the qt4 version. For all we know something might have gone wrong in between the compatibility layer. Also, check if you have the baloosearch kioslave for qt4. Kubuntu packaging is a bit strange, they seem to put it in a non default package.
Comment 20 alexis.gm 2015-02-05 05:46:42 UTC
(In reply to Vishesh Handa from comment #19)
> (In reply to alexis.gm from comment #18)
> > I was looking forward to plasma 5.2 in order to see if the search was
> > working better.   Unfortunately, now search doesn't work at all anymore.   
> > 
> > For example, I open Dolphin, and while I'm at my home directory, I try to
> > search for "Desktop" and I get no result.   I try to search for any file or
> > folder name and nothing comes up.   Yet, the "filter" tool at the bottom can
> > correctly find things that are in the current working directory.   
> > 
> > I'm using a fresh install of Kubuntu 14.10, on which I installed plasma 5.2.
> 
> Please use `baloosearch` and report your findings. Dolphin uses kio, which
> then uses baloo. It could be any one of those things going wrong. Also,
> Dolphin is currently still using the qt4 version. For all we know something
> might have gone wrong in between the compatibility layer. Also, check if you
> have the baloosearch kioslave for qt4. Kubuntu packaging is a bit strange,
> they seem to put it in a non default package.

Strange.   I tried searching for something generic from the command line with baloosearch:

"baloosearch document"

and no results were returned.   I went to the search settings and played around with the "file search" option.   I enabled/disabled "file search", as well as pressed the "Defaults" option.   At this point the command line "baloosearch" worked, as well as searching from Dolphin.   

I tried enabling/disabling file search again and this resulted in a baloo crash.   Now when I try baloosearch from the termina, I get the error message:  "baloosearch(4728): Xapian Database does not exist at  "/home/foo/.local/share/baloo/file/" "

Nevertheless, searching within Dolphin still works, which makes me happy.
Comment 21 Shinjo Park 2015-07-14 15:25:43 UTC
(In reply to Ukyoi from comment #7)
> This issue is a lot serious in CJK (Chinese, Japanses, Korean) world. It
> makes baloo even *unusable*.
> CKJ languages don't use spaces to seperate two words. For example, if I have
> a file named 桌面搜索 (means desktop search) and I can only remember 搜索 (search)
> but not 桌面 (desktop), I'm unable to find it with baloo on. Since baloo have
> only one (none or all) toggle (on KDE 4.13), I have to turn off the whole
> baloo to ensure the searching function working properly.
> 
> As a workaround, would it be possible that providing an option for allowing
> traditional search function run automatically after baloo search?

Just FYI, Korean language has similar spacing rules as most of European languages. When it comes to file names or other space constrained input fields, spacing between multiple words are sometimes omitted.
Comment 22 Ukyoi 2015-09-07 14:43:15 UTC
(In reply to Shinjo Park from comment #21)
> (In reply to Ukyoi from comment #7)
> > This issue is a lot serious in CJK (Chinese, Japanses, Korean) world. It
> > makes baloo even *unusable*.
> > CKJ languages don't use spaces to seperate two words. For example, if I have
> > a file named 桌面搜索 (means desktop search) and I can only remember 搜索 (search)
> > but not 桌面 (desktop), I'm unable to find it with baloo on. Since baloo have
> > only one (none or all) toggle (on KDE 4.13), I have to turn off the whole
> > baloo to ensure the searching function working properly.
> > 
> > As a workaround, would it be possible that providing an option for allowing
> > traditional search function run automatically after baloo search?
> 
> Just FYI, Korean language has similar spacing rules as most of European
> languages. When it comes to file names or other space constrained input
> fields, spacing between multiple words are sometimes omitted.

Thanks for indicating that.

But now, baloo tends to ignore all the CJKV characters and it will "find" all CJKV names whatever you input.
Comment 23 Cjacker 2015-11-19 13:02:19 UTC
(In reply to Ukyoi from comment #22)
> (In reply to Shinjo Park from comment #21)
> > (In reply to Ukyoi from comment #7)
> > > This issue is a lot serious in CJK (Chinese, Japanses, Korean) world. It
> > > makes baloo even *unusable*.
> > > CKJ languages don't use spaces to seperate two words. For example, if I have
> > > a file named 桌面搜索 (means desktop search) and I can only remember 搜索 (search)
> > > but not 桌面 (desktop), I'm unable to find it with baloo on. Since baloo have
> > > only one (none or all) toggle (on KDE 4.13), I have to turn off the whole
> > > baloo to ensure the searching function working properly.
> > > 
> > > As a workaround, would it be possible that providing an option for allowing
> > > traditional search function run automatically after baloo search?
> > 
> > Just FYI, Korean language has similar spacing rules as most of European
> > languages. When it comes to file names or other space constrained input
> > fields, spacing between multiple words are sometimes omitted.
> 
> Thanks for indicating that.
> 
> But now, baloo tends to ignore all the CJKV characters and it will "find"
> all CJKV names whatever you input.

CJK support totally broken.(framework 5.16.0)

For example, a file with name "<Chinese>.xls",  'baloosearch xls' will find this file, but 'baloosearch <Any Chinese Character>' return an empty result.
Comment 24 Cjacker 2015-11-19 13:13:28 UTC
It seems indexer ignore all Chinese character.

$ balooshow -x 测试中文.txt  
6211562092103681 2049 1446242 /home/cjacker/测试中文.txt

Internal Info
Terms: Mapplication Mx Mzerosize 
File Name Terms: Ftxt txt 
XAttr Terms: 

$ balooshow -x a.txt 
6306729977448449 2049 1468400 /home/cjacker/a.txt

Internal Info
Terms: Mplain Mtext T5 T8 
File Name Terms: Fa Ftxt a txt 
XAttr Terms:
Comment 25 Cjacker 2015-11-20 08:17:37 UTC
More infomation:

QTextBoundaryFinder(qt-5.5) ignore all CJKV characters. 
Thus in engine/termgenerator.cpp, the termList ignore all CJKV character, and only Latin terms generated and indexed.

Also in engine/queryparser.cpp, the query string was processed by QTextBoundaryFinder and all CJKV character ignored.

What I do now is indexing every Chinese character as terms and use 'AND' method to search.

Here is a patch only support Chinese, but should be extended to support CJKV.

diff -Nur baloo-5.16.0/src/engine/queryparser.cpp baloo-5.16.0n/src/engine/queryparser.cpp
--- baloo-5.16.0/src/engine/queryparser.cpp	2015-11-08 20:08:54.000000000 +0800
+++ baloo-5.16.0n/src/engine/queryparser.cpp	2015-11-20 13:22:44.928016124 +0800
@@ -161,7 +161,18 @@
         queries << phraseQueries;
         phraseQueries.clear();
     }
-
+    //detect text contains chinese or not.
+    //if contain chinese, every chinese character should be a term.
+    int nCount = text.count();
+    for(int i = 0 ; i < nCount ; i++)
+    {
+        QChar cha = text.at(i);
+        ushort uni = cha.unicode();
+        if(uni >= 0x4E00 && uni <= 0x9FA5)
+        {  
+           queries << EngineQuery(QString(cha).toUtf8(), EngineQuery::StartsWith);
+        }   
+    }   
     if (queries.size() == 1) {
         return queries.first();
     }
diff -Nur baloo-5.16.0/src/engine/termgenerator.cpp baloo-5.16.0n/src/engine/termgenerator.cpp
--- baloo-5.16.0/src/engine/termgenerator.cpp	2015-11-08 20:08:54.000000000 +0800
+++ baloo-5.16.0n/src/engine/termgenerator.cpp	2015-11-20 13:22:45.432016115 +0800
@@ -99,6 +99,19 @@
 void TermGenerator::indexFileNameText(const QString& text, const QByteArray& prefix, int wdfInc)
 {
     QStringList terms = termList(text);
+    //detect text contains chinese or not.
+    //if contain chinese, every chinese character should be a term.
+    int nCount = text.count();  
+    for(int i = 0 ; i < nCount ; i++)  
+    {  
+        QChar cha = text.at(i);  
+        ushort uni = cha.unicode();  
+        if(uni >= 0x4E00 && uni <= 0x9FA5)  
+        {
+           terms<<QString(cha);  
+        }  
+    }
+
     for (const QString& term : terms) {
         QByteArray arr = term.toUtf8();
 
After above patch, these is another minor issue. if a filename contains "你好", searching "你好" or "好你" will find this file, since the query string was seperated to single character and use "AND" method to search. "你" AND "好" equal to "好" AND "你".  Introduce a more complex string-cutting method will resolve it, but the above solution has best performance to support Chinese search. 

I also consider to add Semantic Segment of Chinese string to baloo to support "meaningful" searching, but it will cause high workload.

At least, it can index/search Chinese now.
Comment 26 Vishesh Handa 2015-12-14 22:42:54 UTC
@CJacker: Thanks for the patch. Do you think it would be possible for you to submit this via reviewboard.kde.org along with a few test cases? If not, maybe you could just provide some test cases, I'll be happy to translate them into actual code.
Comment 27 anderlia 2016-10-06 14:36:19 UTC
Created attachment 101448 [details]
Terminal Output
Comment 28 anderlia 2016-10-06 14:37:20 UTC
Dear Vishesh,

Coming back on searching with wildcard * ...

I'm using Kubuntu 16.04 with plasma 5.6, and the search in Dolphin only returns partial results although all files have been properly indexed by Baloo. 

Attached is a screenshot of a terminal output looking for all files in the home directory including the string "alander" by means of the find command. A total of 31 files have been found with that method.

If I do the same query now using the search function of Dolphin, looking for filename only from home directory, and using "alander" as search string, I only get 12 files displayed (out of the 31): these are all files including the string "alander" at the beginning of a word (a filename typically contains several words separated by spaces or underscores). For instance, the file "IMG0423 alanderfredcaroline.JPG" is correctly hit, but the file "P1029318alandermarie.JPG" is not found. See second screenshot.

I have further checked whether the files that weren't hit by Dolphin search were correctly indexed by Baloo (with balooshow command), and they are all properly indexed with all metadata displayed.

If I do the same query in Dolphin with wildcards, namely "*alander" or "*alander*" as search string, then only 4 files are displayed (the ones ending with "alander", such as the file "IMG0417 alander.JPG"). See third screenshot. Same results with baloosearch command: "baloosearch filename:*alander*" only hits the same 4 files. 

Any idea? Please let me know if you need any further debugging info. Thx.
Comment 29 anderlia 2016-10-06 14:40:47 UTC
Created attachment 101449 [details]
Dolphin output
Comment 30 anderlia 2016-10-06 14:42:04 UTC
Created attachment 101450 [details]
Dolphin output
Comment 31 Jonny Mo 2017-08-03 16:40:27 UTC
Yes, the Dolphin search for can not find filenames in certain circumstances. Example:

Search for "veranst" find the file/folder
"Veranstaltungskalender Urban"
"035A_V1_Veranstaltungsplan_Teilnehmerlisten.pdf"

But a search for "eranst" or "*eranst*" find nothing.

It seams that the search string has to be the beginning
- of the filename
- of a word after a space
- of a word after underline character 
- or something like that.

In my opinion, a user expect to find EVERY file whose name includes the string "eranst", at least if he use wildcards.

Small, but a important issue. The search function is a essential function for a filemanager.

Tested on Archlinux with versions:
- plasma-desktop 5.10.4-1
- plasma-framework 5.36.0-2
- KDE Applications 17.04.3-1
Comment 32 Tommi Nieminen 2019-02-24 20:27:33 UTC
I have to agree with the above commentors: it’s absolutely vital to be able to search partial filenames. Maybe it’s not so important in English where compounds are written with spaces (thus permitting word-initial searches to succeed) but for, say, Finnish, Swedish, German, or Russian it means you cannot find files with the equivalent for “test” *somewhere* in the name, only if it’s right at the beginning.

This is not to say I cannot appreciate the difficulties in its implementation!
Comment 33 Ioannis Iliadis-Ilousis 2020-04-05 11:27:03 UTC
For the moment KFind with wildcards is working right.
Comment 34 Ioannis Iliadis-Ilousis 2020-09-10 13:51:00 UTC
Dolphin indeed supports RegEx (Regular Expressions), furthermore supports LookAhead function. So, one workaround (and also a solution for more advanced search) is to make use of LookAhead.
To find every file, containing in its filename the string string1, the submission for search field is (?=.*string1) . To find every file, containing in its filename two or more strings (string1, string2, string3, etc), then is just needed concurrent repetition and adjustment of the initial expression . For two is (?=.*string1)(?=.*string2), for three (?=.*string1)(?=.*string2)(?=.*string3), etc.
Comment 35 Sebastian Kenn 2021-01-15 17:39:11 UTC
The problem described still persists.
I use kde neon 20.04, plasma 5.20.5, up to date.

The problem occurs with searches on the home partition as well as with USB hard drives, with pdf as well as with video files (see attachment).
Comment 36 Sebastian Kenn 2021-01-15 17:42:33 UTC
Created attachment 134902 [details]
Dolphin baloo-search
Comment 37 Sebastian Kenn 2021-01-15 17:43:21 UTC
Created attachment 134903 [details]
Dolphin baloo-search