Summary: | Dolphin does not return every filename, in-which search strings (terms) are partially contained, even when wildcards are in use | ||
---|---|---|---|
Product: | [Frameworks and Libraries] frameworks-baloo | Reporter: | mau <b-misc> |
Component: | Engine | Assignee: | Pinak Ahuja <pinak.ahuja> |
Status: | REOPENED --- | ||
Severity: | wishlist | CC: | alainbidon.anderlini, alexis.gm, aspotashev, bugseforuns, cjacker, gekylafas, gorilych, hein, ict, jonnymo, kdeuser56, leodream2008, nate, nujucobian, pinak.ahuja, s.kenn, tommi.nieminen, ukyoi, yannis.zip |
Priority: | NOR | ||
Version: | 5.16.0 | ||
Target Milestone: | --- | ||
Platform: | Kubuntu | ||
OS: | Linux | ||
See Also: | https://bugs.kde.org/show_bug.cgi?id=362647 | ||
Latest Commit: | Search terms | Version Fixed In: | |
Sentry Crash Report: | |||
Attachments: |
Terminal Output
Dolphin output Dolphin output Dolphin baloo-search Dolphin baloo-search |
Description
mau
2014-04-03 17:09:47 UTC
I'm sorry, but this just cannot be fixed. Searching for words from the middle would require massive changes in Baloo. It would also result in us having to to do double the amount of work - so double the cpu + io usage, double the disk space used. And the complexity of the code would also increase. I'm not inclined on fixing this. No one else provides searching through the middle of a word. Sorry, but it is not true that no one else provides this kind of search functionality: In Windows it works, but only if you type a * before your search term. In Dolphin this doesn't work. Alright. I'll keep this open for now. Though I'm really not sure how to implement this efficiently. Maybe someone else knows a better way. In Windows, when adding a '*', does that work for both content and the filename? For filenames it works; for the content I don't have the impression that this kind of search really works... IMHO searching for partial strings is more important for filenames than for the content. Hmm, for filenames this can be possibly done. I'll see. But don't hold your breath. Git commit f6d91a14b792852b71cdabf6769c4e36cfc3f8ee by Vishesh Handa. Committed on 09/07/2014 at 09:46. Pushed by vhanda into branch 'frameworks'. FileSearch: Allow searching with wildcards in the filename With this when searching explicitly for the property "filename", the value can except some standard glob terms such as "*", "." and "?". These globs can appear anywhere. This operation is NOT cheap, and it is implemented by running a regular expression over each filename in the database. It should only be used as a last resort. This does not solve this bug as this is only activated when explicitly searching for the filename. We still need to hook this up to the frontend for the user to be able to choose "search in filename". This is done in dolphin, but it might be nice to do it in other places such as "krunner" as well. M +4 -1 src/file/search/CMakeLists.txt M +1 -0 src/file/search/autotest/CMakeLists.txt M +73 -0 src/file/search/autotest/filesearchstoretest.cpp M +2 -0 src/file/search/autotest/filesearchstoretest.h M +6 -0 src/file/search/filesearchstore.cpp A +124 -0 src/file/search/wildcardpostingsource.cpp [License: LGPL (v2.1+)] A +64 -0 src/file/search/wildcardpostingsource.h [License: LGPL (v2.1+)] http://commits.kde.org/baloo/f6d91a14b792852b71cdabf6769c4e36cfc3f8ee This issue is a lot serious in CJK (Chinese, Japanses, Korean) world. It makes baloo even *unusable*. CKJ languages don't use spaces to seperate two words. For example, if I have a file named 桌面搜索 (means desktop search) and I can only remember 搜索 (search) but not 桌面 (desktop), I'm unable to find it with baloo on. Since baloo have only one (none or all) toggle (on KDE 4.13), I have to turn off the whole baloo to ensure the searching function working properly. As a workaround, would it be possible that providing an option for allowing traditional search function run automatically after baloo search? Vishesh, you have done great work with baloo, but I agree with the others that this feature is vital. Using grep is much more useful atm. If this operation would be that more expensive try to add an option like "partial words". Update report - We have been working on a user visible query parser which has a simple syntax such as "title:Foo OR artist:Coldplay". It also supports some basic parenthesis. The idea is to expose this query language everywhere where Baloo is used. This would include krunner as well. So, you should be able to do "filename:foo". This would find all files which begin with foo. If you wanted to handle middle characters you can use wildcards. "filename:*foo". Any combination of * can be used. This extra * based approach is more expensive so it might be about 100-500 msecs slower. Depends on your database size. This * based approach is currently only implemented for filenames. We can easily extended it to others though. I'll close this bug when the krunner uses this parser. I want to get that done in time for Plasma 5.1. I already read about the query parser which seems to be really nice - for more sohisticated users/queries. But does that mean that one has to enter "filename:myFileName" everytime or does krunner still work with "myFileName" alone? (In reply to mau from comment #10) > I already read about the query parser which seems to be really nice - for > more sohisticated users/queries. But does that mean that one has to enter > "filename:myFileName" everytime or does krunner still work with "myFileName" > alone? KRunner will still work with "myFileName" as well. I believe that this functionality was working previously, before Baloo was used (so, Nepomunk?). For example, I had a folder full of file names like: quiz1_w2014 quiz1_f2012 quiz3_w2011 quiz1_w2014 I used to be able to just search for "w2014" and get all the relevant quizzes. The way that search is now implemented is almost useless (in my opinion). It is very frustrating that search worked perfectly before but now with baloo there is this huge regression in functionality. I've pretty much given up on searching from within dolphin, and use the terminal to " find . -n *w2011* -R" which is certainly not very handy, but at least it works. As other people have mentioned, searching with wildcards works perfectly in other operating systems. Sorry, I messed up my previous comment... the search command should have been "find . -name *w2011*" Hey Alexis (In reply to alexis.gm from comment #12) > I believe that this functionality was working previously, before Baloo was > used (so, Nepomunk?). For example, I had a folder full of file names like: > > quiz1_w2014 > quiz1_f2012 > quiz3_w2011 > quiz1_w2014 > > I used to be able to just search for "w2014" and get all the relevant > quizzes. The way that search is now implemented is almost useless (in my > opinion). With Plasma 5.1, just searching for "w2014" will work. Before 5.1, we did not take _ into consideration. Wilcard searching has also been implemented for filenames from 5.1, it just is not yet exposed in KRunner. It will definitely be there in 5.2. Vishesh, I hope you don't mind me barging in here, but from skimming the report - does Baloo still do anything in the way of building word indices based on using whitespace as word boundaries, or does it properly use QTextBoundaryFinder these days? (In reply to Eike Hein from comment #15) > Vishesh, I hope you don't mind me barging in here, but from skimming the > report - does Baloo still do anything in the way of building word indices > based on using whitespace as word boundaries, or does it properly use > QTextBoundaryFinder these days? Not at all Eike. It uses QTextBoundaryFinder since Plasma 5.1. Previously we were relying on Xapian's internal splitting, which was not ideal. Good :) I was looking forward to plasma 5.2 in order to see if the search was working better. Unfortunately, now search doesn't work at all anymore. For example, I open Dolphin, and while I'm at my home directory, I try to search for "Desktop" and I get no result. I try to search for any file or folder name and nothing comes up. Yet, the "filter" tool at the bottom can correctly find things that are in the current working directory. I'm using a fresh install of Kubuntu 14.10, on which I installed plasma 5.2. (In reply to alexis.gm from comment #18) > I was looking forward to plasma 5.2 in order to see if the search was > working better. Unfortunately, now search doesn't work at all anymore. > > For example, I open Dolphin, and while I'm at my home directory, I try to > search for "Desktop" and I get no result. I try to search for any file or > folder name and nothing comes up. Yet, the "filter" tool at the bottom can > correctly find things that are in the current working directory. > > I'm using a fresh install of Kubuntu 14.10, on which I installed plasma 5.2. Please use `baloosearch` and report your findings. Dolphin uses kio, which then uses baloo. It could be any one of those things going wrong. Also, Dolphin is currently still using the qt4 version. For all we know something might have gone wrong in between the compatibility layer. Also, check if you have the baloosearch kioslave for qt4. Kubuntu packaging is a bit strange, they seem to put it in a non default package. (In reply to Vishesh Handa from comment #19) > (In reply to alexis.gm from comment #18) > > I was looking forward to plasma 5.2 in order to see if the search was > > working better. Unfortunately, now search doesn't work at all anymore. > > > > For example, I open Dolphin, and while I'm at my home directory, I try to > > search for "Desktop" and I get no result. I try to search for any file or > > folder name and nothing comes up. Yet, the "filter" tool at the bottom can > > correctly find things that are in the current working directory. > > > > I'm using a fresh install of Kubuntu 14.10, on which I installed plasma 5.2. > > Please use `baloosearch` and report your findings. Dolphin uses kio, which > then uses baloo. It could be any one of those things going wrong. Also, > Dolphin is currently still using the qt4 version. For all we know something > might have gone wrong in between the compatibility layer. Also, check if you > have the baloosearch kioslave for qt4. Kubuntu packaging is a bit strange, > they seem to put it in a non default package. Strange. I tried searching for something generic from the command line with baloosearch: "baloosearch document" and no results were returned. I went to the search settings and played around with the "file search" option. I enabled/disabled "file search", as well as pressed the "Defaults" option. At this point the command line "baloosearch" worked, as well as searching from Dolphin. I tried enabling/disabling file search again and this resulted in a baloo crash. Now when I try baloosearch from the termina, I get the error message: "baloosearch(4728): Xapian Database does not exist at "/home/foo/.local/share/baloo/file/" " Nevertheless, searching within Dolphin still works, which makes me happy. (In reply to Ukyoi from comment #7) > This issue is a lot serious in CJK (Chinese, Japanses, Korean) world. It > makes baloo even *unusable*. > CKJ languages don't use spaces to seperate two words. For example, if I have > a file named 桌面搜索 (means desktop search) and I can only remember 搜索 (search) > but not 桌面 (desktop), I'm unable to find it with baloo on. Since baloo have > only one (none or all) toggle (on KDE 4.13), I have to turn off the whole > baloo to ensure the searching function working properly. > > As a workaround, would it be possible that providing an option for allowing > traditional search function run automatically after baloo search? Just FYI, Korean language has similar spacing rules as most of European languages. When it comes to file names or other space constrained input fields, spacing between multiple words are sometimes omitted. (In reply to Shinjo Park from comment #21) > (In reply to Ukyoi from comment #7) > > This issue is a lot serious in CJK (Chinese, Japanses, Korean) world. It > > makes baloo even *unusable*. > > CKJ languages don't use spaces to seperate two words. For example, if I have > > a file named 桌面搜索 (means desktop search) and I can only remember 搜索 (search) > > but not 桌面 (desktop), I'm unable to find it with baloo on. Since baloo have > > only one (none or all) toggle (on KDE 4.13), I have to turn off the whole > > baloo to ensure the searching function working properly. > > > > As a workaround, would it be possible that providing an option for allowing > > traditional search function run automatically after baloo search? > > Just FYI, Korean language has similar spacing rules as most of European > languages. When it comes to file names or other space constrained input > fields, spacing between multiple words are sometimes omitted. Thanks for indicating that. But now, baloo tends to ignore all the CJKV characters and it will "find" all CJKV names whatever you input. (In reply to Ukyoi from comment #22) > (In reply to Shinjo Park from comment #21) > > (In reply to Ukyoi from comment #7) > > > This issue is a lot serious in CJK (Chinese, Japanses, Korean) world. It > > > makes baloo even *unusable*. > > > CKJ languages don't use spaces to seperate two words. For example, if I have > > > a file named 桌面搜索 (means desktop search) and I can only remember 搜索 (search) > > > but not 桌面 (desktop), I'm unable to find it with baloo on. Since baloo have > > > only one (none or all) toggle (on KDE 4.13), I have to turn off the whole > > > baloo to ensure the searching function working properly. > > > > > > As a workaround, would it be possible that providing an option for allowing > > > traditional search function run automatically after baloo search? > > > > Just FYI, Korean language has similar spacing rules as most of European > > languages. When it comes to file names or other space constrained input > > fields, spacing between multiple words are sometimes omitted. > > Thanks for indicating that. > > But now, baloo tends to ignore all the CJKV characters and it will "find" > all CJKV names whatever you input. CJK support totally broken.(framework 5.16.0) For example, a file with name "<Chinese>.xls", 'baloosearch xls' will find this file, but 'baloosearch <Any Chinese Character>' return an empty result. It seems indexer ignore all Chinese character. $ balooshow -x 测试中文.txt 6211562092103681 2049 1446242 /home/cjacker/测试中文.txt Internal Info Terms: Mapplication Mx Mzerosize File Name Terms: Ftxt txt XAttr Terms: $ balooshow -x a.txt 6306729977448449 2049 1468400 /home/cjacker/a.txt Internal Info Terms: Mplain Mtext T5 T8 File Name Terms: Fa Ftxt a txt XAttr Terms: More infomation: QTextBoundaryFinder(qt-5.5) ignore all CJKV characters. Thus in engine/termgenerator.cpp, the termList ignore all CJKV character, and only Latin terms generated and indexed. Also in engine/queryparser.cpp, the query string was processed by QTextBoundaryFinder and all CJKV character ignored. What I do now is indexing every Chinese character as terms and use 'AND' method to search. Here is a patch only support Chinese, but should be extended to support CJKV. diff -Nur baloo-5.16.0/src/engine/queryparser.cpp baloo-5.16.0n/src/engine/queryparser.cpp --- baloo-5.16.0/src/engine/queryparser.cpp 2015-11-08 20:08:54.000000000 +0800 +++ baloo-5.16.0n/src/engine/queryparser.cpp 2015-11-20 13:22:44.928016124 +0800 @@ -161,7 +161,18 @@ queries << phraseQueries; phraseQueries.clear(); } - + //detect text contains chinese or not. + //if contain chinese, every chinese character should be a term. + int nCount = text.count(); + for(int i = 0 ; i < nCount ; i++) + { + QChar cha = text.at(i); + ushort uni = cha.unicode(); + if(uni >= 0x4E00 && uni <= 0x9FA5) + { + queries << EngineQuery(QString(cha).toUtf8(), EngineQuery::StartsWith); + } + } if (queries.size() == 1) { return queries.first(); } diff -Nur baloo-5.16.0/src/engine/termgenerator.cpp baloo-5.16.0n/src/engine/termgenerator.cpp --- baloo-5.16.0/src/engine/termgenerator.cpp 2015-11-08 20:08:54.000000000 +0800 +++ baloo-5.16.0n/src/engine/termgenerator.cpp 2015-11-20 13:22:45.432016115 +0800 @@ -99,6 +99,19 @@ void TermGenerator::indexFileNameText(const QString& text, const QByteArray& prefix, int wdfInc) { QStringList terms = termList(text); + //detect text contains chinese or not. + //if contain chinese, every chinese character should be a term. + int nCount = text.count(); + for(int i = 0 ; i < nCount ; i++) + { + QChar cha = text.at(i); + ushort uni = cha.unicode(); + if(uni >= 0x4E00 && uni <= 0x9FA5) + { + terms<<QString(cha); + } + } + for (const QString& term : terms) { QByteArray arr = term.toUtf8(); After above patch, these is another minor issue. if a filename contains "你好", searching "你好" or "好你" will find this file, since the query string was seperated to single character and use "AND" method to search. "你" AND "好" equal to "好" AND "你". Introduce a more complex string-cutting method will resolve it, but the above solution has best performance to support Chinese search. I also consider to add Semantic Segment of Chinese string to baloo to support "meaningful" searching, but it will cause high workload. At least, it can index/search Chinese now. @CJacker: Thanks for the patch. Do you think it would be possible for you to submit this via reviewboard.kde.org along with a few test cases? If not, maybe you could just provide some test cases, I'll be happy to translate them into actual code. Created attachment 101448 [details]
Terminal Output
Dear Vishesh, Coming back on searching with wildcard * ... I'm using Kubuntu 16.04 with plasma 5.6, and the search in Dolphin only returns partial results although all files have been properly indexed by Baloo. Attached is a screenshot of a terminal output looking for all files in the home directory including the string "alander" by means of the find command. A total of 31 files have been found with that method. If I do the same query now using the search function of Dolphin, looking for filename only from home directory, and using "alander" as search string, I only get 12 files displayed (out of the 31): these are all files including the string "alander" at the beginning of a word (a filename typically contains several words separated by spaces or underscores). For instance, the file "IMG0423 alanderfredcaroline.JPG" is correctly hit, but the file "P1029318alandermarie.JPG" is not found. See second screenshot. I have further checked whether the files that weren't hit by Dolphin search were correctly indexed by Baloo (with balooshow command), and they are all properly indexed with all metadata displayed. If I do the same query in Dolphin with wildcards, namely "*alander" or "*alander*" as search string, then only 4 files are displayed (the ones ending with "alander", such as the file "IMG0417 alander.JPG"). See third screenshot. Same results with baloosearch command: "baloosearch filename:*alander*" only hits the same 4 files. Any idea? Please let me know if you need any further debugging info. Thx. Created attachment 101449 [details]
Dolphin output
Created attachment 101450 [details]
Dolphin output
Yes, the Dolphin search for can not find filenames in certain circumstances. Example: Search for "veranst" find the file/folder "Veranstaltungskalender Urban" "035A_V1_Veranstaltungsplan_Teilnehmerlisten.pdf" But a search for "eranst" or "*eranst*" find nothing. It seams that the search string has to be the beginning - of the filename - of a word after a space - of a word after underline character - or something like that. In my opinion, a user expect to find EVERY file whose name includes the string "eranst", at least if he use wildcards. Small, but a important issue. The search function is a essential function for a filemanager. Tested on Archlinux with versions: - plasma-desktop 5.10.4-1 - plasma-framework 5.36.0-2 - KDE Applications 17.04.3-1 I have to agree with the above commentors: it’s absolutely vital to be able to search partial filenames. Maybe it’s not so important in English where compounds are written with spaces (thus permitting word-initial searches to succeed) but for, say, Finnish, Swedish, German, or Russian it means you cannot find files with the equivalent for “test” *somewhere* in the name, only if it’s right at the beginning. This is not to say I cannot appreciate the difficulties in its implementation! For the moment KFind with wildcards is working right. Dolphin indeed supports RegEx (Regular Expressions), furthermore supports LookAhead function. So, one workaround (and also a solution for more advanced search) is to make use of LookAhead. To find every file, containing in its filename the string string1, the submission for search field is (?=.*string1) . To find every file, containing in its filename two or more strings (string1, string2, string3, etc), then is just needed concurrent repetition and adjustment of the initial expression . For two is (?=.*string1)(?=.*string2), for three (?=.*string1)(?=.*string2)(?=.*string3), etc. The problem described still persists. I use kde neon 20.04, plasma 5.20.5, up to date. The problem occurs with searches on the home partition as well as with USB hard drives, with pdf as well as with video files (see attachment). Created attachment 134902 [details]
Dolphin baloo-search
Created attachment 134903 [details]
Dolphin baloo-search
|