307585 – Sorting is too cpu intensive for big folders

Bug 307585 - Sorting is too cpu intensive for big folders

Summary: Sorting is too cpu intensive for big folders

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	dolphin
Classification:	Applications
Component:	general (show other bugs)
Version:	2.1
Platform:	unspecified Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Dolphin Bug Assignee

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-09-29 18:56 UTC by Mark
Modified:	2012-10-06 15:14 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
First draft of a patch (2.46 KB, patch) 2012-09-30 09:19 UTC, Frank Reininghaus	Details
Profiling data targeted at ::insertItems (280.08 KB, application/octet-stream) 2012-09-30 11:44 UTC, Mark	Details
very hacky sorting improvement. Twice as fast as the current dolphin (52.69 KB, patch) 2012-09-30 21:02 UTC, Mark	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Mark 2012-09-29 18:56:45 UTC

Hi,

While i was profiling #306290 i noticed that sorting the folder list takes op an enormous amount of time. We are talking about really huge folders with 100.000 files or more (depending on your hardware where you start noticing this).

The current sorting algorithm can be found in kfileitemmodel.cpp in the function: KFileItemModel::insertItems.

In there it does something quite inefficient. It has 2 loops and has 2 lists.
1. the list of all items
2. the list of the resulting sorted items

In pseudo code it's doing something along the following lines:

- Loop over all items (list 1)
- - per item loop over the second list (list 2) to find the last inserted item. No, this does not have the be the "last" item in the list. If it finds the last index it adds the current item after that.

This is very intensive as the second list grows with every item added thus the inner loop gets bigger and bigger to iterate through.

This sorting seems really odd and inefficient and could benefit greatly if done differently. That probably does require changing quite a bit of code in the kitemmodel.cpp class thus is no small task.

Also, that isn't the only sorting done. A "KFileItemModelSortAlgorithm::sort" is also done and i have yet to figure out what it does exactly. That one also shows up very high in profiling output.

I am still investigating this issue, but i think it should be as simple as or something alike the following:
QList<ItemData>* items;
qSort(items)
sorting done.

to be continued.

Comment 1 Mark 2012-09-29 21:50:15 UTC

Little update on this one.

I am right now investigating the sorting stuff in Dolphin. What i've done right now is simply getting rid of nearly all sorting and start using a QMap because it's sorted by default.

The initial results are promising. There is still a 100% cpu usage, but the folder fetching is going noticeable faster and scrolling through a massive folder is working even though 1 cpu had 100% usage.

The downside is that a QMap isn't perfect for sorting. Like i'm getting results like this:
AAA
BBB
CCC
aaa
bbb
ccc

instead of:
aaa
AAA
bbb
BBB
ccc
CCC

Comment 2 Frank Reininghaus 2012-09-30 09:15:35 UTC

Thanks for your analysis, Mark!

(In reply to comment #0)
> The current sorting algorithm can be found in kfileitemmodel.cpp in the
> function: KFileItemModel::insertItems.
> 
> In there it does something quite inefficient. It has 2 loops and has 2 lists.
> 1. the list of all items
> 2. the list of the resulting sorted items
> 
> In pseudo code it's doing something along the following lines:
> 
> - Loop over all items (list 1)
> - - per item loop over the second list (list 2) to find the last inserted
> item. No, this does not have the be the "last" item in the list. If it finds
> the last index it adds the current item after that.
> 
> This is very intensive as the second list grows with every item added thus
> the inner loop gets bigger and bigger to iterate through.

I think that you refer to the loop

while (sourceIndex < sortedItems.count()) {
  ...
}

The run time of this part of the code should ideally be O(N), where N is the maximum of the number of the items that are inserted and the number of items that are in the model already.

But I can see that even then there is a problem if the 100000 items are inserted not in one go, but in many small groups. Then we have to run the insertItems() function O(N) times and end up with O(N^2) run time behaviour, which is very bad.

The first part of the loop,

        while (targetIndex < m_itemData.count()) {
            if (!lessThan(m_itemData.at(targetIndex), sortedItems.at(sourceIndex))) {
                break;
            }
            ++targetIndex;
        }

could be replaced by a binary search, eliminating one source of O(N) behaviour. There is still the adjustment of the indices in m_items after the main loop, which we can't do much about, I'm afraid. But then there is the following line inside the loop:

m_itemData.insert(targetIndex, sortedItems.at(sourceIndex));

This could actually be the worst part of it, because insertion in lists is O(N), and this can happen O(N) times in a single call of the function which is possibly called many times. To change this, one would not insert into m_itemData directly, but rather merge the two lists sortedItems and m_itemData into a single new list (which has reserved space for both lists combined). We would only have to *append*, which is O(1), rather than *insert*, which is O(N), then.

> This sorting seems really odd and inefficient and could benefit greatly if
> done differently. That probably does require changing quite a bit of code in
> the kitemmodel.cpp class thus is no small task.

If my assumption about the bottleneck is right, it should be doable. I'll attach a patch in a minute.

> Also, that isn't the only sorting done. A
> "KFileItemModelSortAlgorithm::sort" is also done and i have yet to figure
> out what it does exactly. That one also shows up very high in profiling
> output.

This function does the pre-sorting of the items to be inserted in insertItems() and also performs the re-sorting if the sort role or sort order is changed.

Comment 3 Frank Reininghaus 2012-09-30 09:19:40 UTC

Created attachment 74239 [details]
First draft of a patch

Mark, can you test if this patch (a bit hackish, must definitely be polished a bit) improves anything?

I would also be curious to know how many times the insertItems() function is called when you load your huge folder.

Comment 4 Mark 2012-09-30 10:41:19 UTC

Hi Frank,

The insertItems function is only called once.
The entire insertItems comes up as one massive resource user. The thing you've patched will probably speed it up a little, but the no.1 resource user in that function is KFileItemModelSortAlgorithm::sort.

Right now i'm completely rebuilding the way this sorting is done and moving it up way earlier in the function calls. I'm targeting KFileItemModel::slotNewItems for the sorting since that function is already iterating over all the items. So it makes sense to sort there. The issue i keep having thus far is no natural sorting -_-

Anyway, some steps i will do next:
1. Profiling dolphin as it is now with the current sorting stuff and putting that profile in here for others to see.
2. Test out your patch
3. Continue playing with sorting

i will report back in here once i have targeted profiling data.

Comment 5 Mark 2012-09-30 11:44:39 UTC

Created attachment 74242 [details]
Profiling data targeted at ::insertItems

This file is the profiling data from the current dolphin without any patches from me or frank. Last commit hash: 15c66cafb7cfd0c9725b620e2ba7e96b389f5f8e

The profiling is made on a directory with 100.000 files in it. Those files are generated using the following command: touch a{1..100000}.txt

Comment 6 Frank Reininghaus 2012-09-30 12:04:31 UTC

(In reply to comment #4)
> The insertItems function is only called once.

In that case, my patch won't change anything. My patch targets the case that "many" items are in the model already, and "many" are added.

> The entire insertItems comes up as one massive resource user. The thing
> you've patched will probably speed it up a little, but the no.1 resource
> user in that function is KFileItemModelSortAlgorithm::sort.

OK, but this doesn't actually say much. If many items are to be sorted, it's kind of expected that sorting takes some time. The crucial question is how the time required to sort scales with the number of items. It should ideally be O(N*log(N)).

> Right now i'm completely rebuilding the way this sorting is done and moving
> it up way earlier in the function calls. I'm targeting
> KFileItemModel::slotNewItems for the sorting since that function is already
> iterating over all the items. So it makes sense to sort there.

I can't see how this could improve anything. The runtime complexity of the sort operation does not depend at all on the place where the sorting is done. Moreover, you'd have to consider the possibility that the user changes the sort order/sort role between the execution of slotNewItems() and inserItems().

> The issue i
> keep having thus far is no natural sorting -_-

This is obviously a no-go ;-)

> Anyway, some steps i will do next:
> 1. Profiling dolphin as it is now with the current sorting stuff and putting
> that profile in here for others to see.
> 2. Test out your patch
> 3. Continue playing with sorting
> 
> i will report back in here once i have targeted profiling data.

I think that profiling data will be much more useful if you obtain them with a source build. Also Qt should be built from source or at least have debugging symbols because it seems that much time is spent in unknown Qt functions according to your log.

Comment 7 Mark 2012-09-30 12:36:46 UTC

I don't think i need debug symbols for Qt. I do have a Qt debug build here and can just link to that, but that doesn't really help a lot since the sorting is a custom thingy done in Dolphin code, not in Qt.

The thing i keep thinking is that everything is in place to work fast but that the right calls are not done. For example, i think this would do the trick:

void KFileItemModel::insertItems(const KFileItemList& items)
{
...
QList<ItemData*> sortedItems = createItemDataList(items);
qSort(sortedItems.constBegin(), sortedItems.constEnd(), KFileItemModel::lessThan);
All sorting DONE!
...
some housekeeping as in updating indexes
certainly no other complex things
}

However, that doesn't work because KFileItemModel::lessThan is a class member and requires other class members to work. qSort needs a global function there.

I know this is a lot faster and cleaner + that it completely ignores KFileItemModelSortAlgorithm::sort which is very slow anyway. I just have to get the above working :)

Comment 8 Frank Reininghaus 2012-09-30 13:01:13 UTC

(In reply to comment #7)
> I don't think i need debug symbols for Qt. I do have a Qt debug build here
> and can just link to that, but that doesn't really help a lot since the
> sorting is a custom thingy done in Dolphin code, not in Qt.

I was mostly referring to Qt functions which are called in other areas.

> However, that doesn't work because KFileItemModel::lessThan is a class
> member and requires other class members to work. qSort needs a global
> function there.

IIRC, this is exactly the reason why Peter did not use something like qSort() when he wrote the new view engine.

> I know this is a lot faster and cleaner + that it completely ignores
> KFileItemModelSortAlgorithm::sort which is very slow anyway. I just have to
> get the above working :)

I don't have much time now, but I would strongly recommend to study the git logs for the sorting code. Probably Peter's commit messages provide some information about the design choices he made.

Comment 9 Mark 2012-09-30 13:06:49 UTC

Ohh, but i can make it work :)

The cleanest way is to overload KFileItem and add operator overloading for the < operator. But that makes the bookkeeping in other areas more tricky since i always have to cast KFileItem to the overloaded one. just for one operator...

So i'm tempted to do a dirty way now just to get it working.

Comment 10 Frank Reininghaus 2012-09-30 13:49:18 UTC

(In reply to comment #9)
> The cleanest way is to overload KFileItem and add operator overloading for
> the < operator.

I fail to see how you would tell such an overloaded operator about the sort role and sort order.

> So i'm tempted to do a dirty way now just to get it working.

Before you waste time on a dirty hack which cannot be included in Dolphin anyway, I recommend again to examine the git history. When you work on code that you think is not good, you should always find out first why the code is like it is. There might be a good reason for the design choices that is not obvious when you just look at the code.

Comment 11 Mark 2012-09-30 15:13:13 UTC

(In reply to comment #10)
> (In reply to comment #9)
> > The cleanest way is to overload KFileItem and add operator overloading for
> > the < operator.
> 
> I fail to see how you would tell such an overloaded operator about the sort
> role and sort order.

Well, qSort simply requires either an overloaded < operator to work or a public function. It's one of the two :p
> 
> > So i'm tempted to do a dirty way now just to get it working.
> 
> Before you waste time on a dirty hack which cannot be included in Dolphin
> anyway, I recommend again to examine the git history. When you work on code
> that you think is not good, you should always find out first why the code is
> like it is. There might be a good reason for the design choices that is not
> obvious when you just look at the code.

I just did that and i couldn't find any reasoning for using this versus qSort. When i look at the revision log it seems like the sorting progressed over time to what it is now.

So i went on with my hack with the strong believe that it is simply faster. I've got it working now and it works just fine even if you change the sorting. This is feature complete, but 100% hacky and is really just a showcase to see if qSort can be used. I didn't go for < overloading, but went for making all required functions global public :)

The numbers for sorting?
Before: ~4.000.000.000 "cycles"
Now: ~3.000.000.000 "cycles"

That's quite a speedup, but still not enough. Of those cycles, ~2.600.000.000 are spend in KStringHandler::naturalCompare! That is not something i can speed up. What might be possible is calling it less often. Right now that function is called well over 1.800.000 times for a dataset of 100.000.

/me continues. These are the awesome bugs :)
The clean way to implement this would imho be recreating the kfileitemmodelsortalgorithm.h class and putting all the functions and params in there that i had to make public. That all is very sorting related so that would clean it up rather nicely.

Comment 12 Mark 2012-09-30 17:04:58 UTC

Another 1.000.000.000 down.
Now: ~2.000.000.000 for sorting.

Note: compared to the original sorting this new sorting is twice as fast now. My goal now is to try and get the complete sorting below 1 billion.

The trick that did that was splitting out the full data container in folders and "the rest". Then i'm sorting each list individually and more importantly completely remove the folder check in the lessThan function which prevents millions of calls to isDir. The rest still remained the same and is still feature complete compared to the current dolphin. I will post my current patch later today.

Comment 13 Mark 2012-09-30 21:02:07 UTC

Created attachment 74255 [details]
very hacky sorting improvement. Twice as fast as the current dolphin

This is my extremely hack patch. It improves the sorting performance 2x when compared to the current dolphin release. I did include the natural sorting function itself in here for profiling ease.

Right now that exact function is the one sucking up nearly all the time spend in sorting. I can't think of any way to improve that function :(

Comment 14 Frank Reininghaus 2012-10-01 07:38:02 UTC

Thanks for showing your patch. I appreciate that you try to make sorting faster. However, if you had taken the time to read and understand the Doxygen comment in kfileitemmodelsortalgorithm.h before you started hacking, you would have seen that some of the decisions you took are just wrong.

1. The reason for having a custom sorting implementation, rather than just using qStableSort() directly, is that we want to use a class member as the comparison function. I already asked you in comment 10 how you would set the sort role and sort order if you don't do it this way. Now I see that you are using global variables for that. It should be obvious that this is a no-go.

2. If you know a thing or two about the run-time complexity of different sort algorithms (if you don't, trying to improve sorting makes no sense at all IMHO), you know that qSort(), which you propose to use, and qStableSort(), which is the base of the current sorting implementation, use different sort algorithms (Quicksort and Mergesort, respectively), which both have an average complexity of O(N*log(N)). This means that the average number of comparisons needed to sort N items is N * log(N) * "some constant factor which depends on the algorithm". This factor is a bit smaller for Quicksort, which probably explains a considerable part of the performance gain you are seeing. However, Quicksort has a disastrous worst-case complexity of O(N^2), whereas Mergesort is always O(N*log(N)). In case that it isn't clear what this means: I would just have to look at the implementation of qSort(), check how it chooses the pivot element, and I could write a script that creates a folder with a given number of files, which have names and modification times crafted in such a way that changing the sort role from "Name" to "Date" would hit the worst case. If you think that N*log(N) comparisons are "too cpu intensive" for N=100000 items, I'll let you do the maths yourself and answer the question if we want something like that or not.

If we forget about these things, your idea to first group folders and files remains. I agree that this idea is sort of nice, but I don't see the KFileItem.:isDir() showing up significantly in your Callgrind log. Therefore, I doubt that the performance gain we could get from this is worth the increased code complexity. Note that code complexity does not matter only when the code is written, but also, and I would say much more importantly, when the code is read. Code is read many more times than it's written, so we want to make sure that the code is straightforward and easy to understand. Performance is important as well, of course, but there should really be a significant performance gain before we start even considering to make the code more complex.

Comment 15 Mark 2012-10-01 11:25:20 UTC

Hi Frank,

I completely agree with you! This is by no means acceptable code and should not be merged! Let me remind you that i'm merely playing with this code and trying to find the most optimal way to sort a directory.

My next attempt is going to be even more daring since i will skip sorting completely and try to use a btree because it's sorted in a natural order by nature. I don't know how that ends up. It might be a complete failure and that's no problem either. Again, i'm just playing with finding the most optimal performance before i make it in an acceptable mergable way.

So it's just playing for now. From now on I will try to keep the noise low in this bug report :)

Comment 16 Frank Reininghaus 2012-10-01 12:42:49 UTC

Thanks for the quick reply.

You are of course welcome to play around as much as you like, but using a different data structure is very unlikely to make any difference. If you think that this would "skip sorting", I'm afraid that you are mistaken. Constructing a data structure which is "sorted by nature" is equivalent to sorting, and I believe that it is impossible to do that in less than O(N*log(N)) steps.

Anyway, I'll close this report because I haven't seen any evidence yet that the current implementation is slower than it needs to be.

Comment 17 Christoph Feck 2012-10-02 01:59:59 UTC

N = 100.000 => N * log N = 1.660.000, which is pretty close to the 1.800.000 compares you got.

If that needs 2.000.000.000 cycles, then over 1000 cycles are spend in each compare. Instead of optimizing the sort algorithm, you need to optimize the string compare algorithm. If you can get each compare down to 100 cycles, then the sorting is faster by a factor of 10.

Currently, localeAwareCompare() allocates two temporary strings, converts each string to a collating sequence, then compares them. To make that faster, you could pre-generate the collating sequences for each string. Optimal would be a localeAwareCompare() that does not need any temporary buffers at all.

Comment 18 Mark 2012-10-02 19:03:03 UTC

(In reply to comment #17)
> N = 100.000 => N * log N = 1.660.000, which is pretty close to the 1.800.000
> compares you got.
> 
> If that needs 2.000.000.000 cycles, then over 1000 cycles are spend in each
> compare. Instead of optimizing the sort algorithm, you need to optimize the
> string compare algorithm. If you can get each compare down to 100 cycles,
> then the sorting is faster by a factor of 10.
> 
> Currently, localeAwareCompare() allocates two temporary strings, converts
> each string to a collating sequence, then compares them. To make that
> faster, you could pre-generate the collating sequences for each string.
> Optimal would be a localeAwareCompare() that does not need any temporary
> buffers at all.

Actually, the bottlenecks inside the naturalCompare function are:
~500.000.000 for QChar::isDigit (is being called ~38 million times)
~370.000.000 for QStringRef::localeAwareCompare (is being called ~1.800.000 times)

The two QStringRef allocations are "only" costing about ~22.000.000 so it's not a real bottleneck at the moment :) I wonder how one can optimize the isDigit one.. That one really sucks up time.

Comment 19 Christoph Feck 2012-10-02 21:01:12 UTC

static inline bool isDigit(QChar c)
{
    return unsigned(c.unicode() - '0') < 10U;
}

This will only handle arabic digits, but I don't know if non-arabic digits have been handled before. The allocations I was speaking about are those that are done inside localeAwareCompare().

Regarding your request on the qt list, I looked at http://unicode.org/reports/tr10/ (Unicode Collating), but unfortunately, it says it won't handle numerical sorting, so it looks like we still need the custom code even with Qt 5.x.

Comment 20 Mark 2012-10-02 21:25:50 UTC

(In reply to comment #19)
> static inline bool isDigit(QChar c)
> {
>     return unsigned(c.unicode() - '0') < 10U;
> }
> 
> This will only handle arabic digits, but I don't know if non-arabic digits
> have been handled before. The allocations I was speaking about are those
> that are done inside localeAwareCompare().

Ohh you actually meant inside localeAwareCompare.
I'm not touching Qt internals :) This optimizing is difficult enough as it is.
> 
> Regarding your request on the qt list, I looked at
> http://unicode.org/reports/tr10/ (Unicode Collating), but unfortunately, it
> says it won't handle numerical sorting, so it looks like we still need the
> custom code even with Qt 5.x.

The collating sounds interesting. I wonder how fast it will be. Thiago replied in my request and told me that "QCollator will be public in Qt 5.1." :) I don't know if that one will do numerical sorting...

Comment 21 Mark 2012-10-05 16:56:25 UTC

Did anyone try to use "ls -v" on the command line in a folder with A LOT of files? 100k or so will do just fine. Note the sorting is natural and the speed is way more superior then Dolphin! It's fast! Freaking fast! How do they do that?

Comment 22 Christoph Feck 2012-10-05 17:33:34 UTC

They don't use QString.

Comment 23 Mark 2012-10-05 17:36:28 UTC

(In reply to comment #22)
> They don't use QString.

No, you have to be joking! That can't possibly be the massive time difference, right? QString was highly optimised right? equal or faster then std::string right?

Comment 24 Christoph Feck 2012-10-05 17:54:50 UTC

I just checked the coreutils sources. Assuming "filevercmp.c" is the correct file that handles natural sorting, it is not locale aware.

Comment 25 Mark 2012-10-05 19:34:30 UTC

(In reply to comment #24)
> I just checked the coreutils sources. Assuming "filevercmp.c" is the correct
> file that handles natural sorting, it is not locale aware.

but is locale aware really needed?

Comment 26 Frank Reininghaus 2012-10-06 12:35:49 UTC

I don't know how "ls" does it, but the only obvious way to speed up the comparison that I can see at the moment would be to cache some intermediate results which KStringHandler::naturalCompare() needs. All these isDigit() calls are needed to split the string into a sequence of number and non-number chunks. One could do this splitting just once and use this to do a faster comparison. The slow part would then only be done N times, and the O(N*log(N)) comparisions would be fast.

I want to make clear though that I do not believe that this approach should be implemented. I don't think that the time saving for this corner case would be worth the greatly increased code complexity and maintenance effort. As soon as we implement this, someone will probably try to open a folder with a million files, and the next bottleneck will show up.

Comment 27 Mark 2012-10-06 14:45:39 UTC

(In reply to comment #26)
> I don't know how "ls" does it, but the only obvious way to speed up the
> comparison that I can see at the moment would be to cache some intermediate
> results which KStringHandler::naturalCompare() needs. All these isDigit()
> calls are needed to split the string into a sequence of number and
> non-number chunks. One could do this splitting just once and use this to do
> a faster comparison. The slow part would then only be done N times, and the
> O(N*log(N)) comparisions would be fast.
> 
> I want to make clear though that I do not believe that this approach should
> be implemented. I don't think that the time saving for this corner case
> would be worth the greatly increased code complexity and maintenance effort.
> As soon as we implement this, someone will probably try to open a folder
> with a million files, and the next bottleneck will show up.

Well, i agree with you if one tries to open folders with millions of files. That's just absurd and the user doing that should not expect it to even work. Besides that you "might" even get into trouble with filesystem support. Ext 3 that is, ext4 can cope with it.

But optimizing it so that 100.000 files (or 50.000 since you already notice it there) show up "fast" is something that should imho be "bearable" in dolphin. Folders of that massive amount of files are obviously unexpected and rare to encounter but might be possible. For instance a photographer that dumps his entire photo collection from some long holiday in one folder ready for that person to sort things out. Or even web development where all generated thumbnails get dumped in one folder. Yes, those are corner cases but not that impossible.

As for optimizing the natural sorting algorithm. I already tried one thing. Putting all characters in a hash map like so:

QHash<QChar, bool> digitCache;

This would know for each character if it's a digit or not. That "solution" is far from faster. In total it takes a LOT more time to calculate the hash and find the right value then to just do isDigit.. Even though QHash is O(1), calculating the actual hash value takes time. More time then idDigit needs to run to do it's thing. It would have been very nice if this where working fast since the map would be quite small since there would be no double values in it.

I don't know if pre splitting the string somewhere outside of naturalCompare and feeding the pre parsed results to naturalCompare is going to be faster. There will be addotional loopup time whichever container you choose, even on the O(1) ones. Though if something like this is done the natural sorting would obviously not end up in KStringHandler::naturalCompare anymore since it becomes dolphin specific. It would have to be in dolphin.

Comment 28 Christoph Feck 2012-10-06 15:06:04 UTC

> QHash<QChar, bool> digitCache;

You made my day.

Comment 29 Mark 2012-10-06 15:14:12 UTC

(In reply to comment #28)
> > QHash<QChar, bool> digitCache;
> 
> You made my day.

Cool huh :)
Not very good for performance ^_^ But i took the idea to the char level. The downside is a _lot_ hash lookups.

Could you explain your awesome idea?