Bug 210201 - Amarok collection scanner freezes while scanning/adding too high number of files
Summary: Amarok collection scanner freezes while scanning/adding too high number of files
Status: RESOLVED FIXED
Alias: None
Product: amarok
Classification: Applications
Component: Collections/Local (show other bugs)
Version: 2.2.0
Platform: Ubuntu Linux
: NOR crash
Target Milestone: ---
Assignee: Amarok Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-11 16:38 UTC by mathesis
Modified: 2009-11-26 14:10 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
mysqld.log.bz2 (107.99 KB, application/octet-stream)
2009-10-17 07:50 UTC, Matt Whitlock
Details
mysqld_slow_query.log.bz2 (123.13 KB, application/octet-stream)
2009-10-17 09:50 UTC, Matt Whitlock
Details
Adds debug output to some SQL queries (2.31 KB, patch)
2009-10-17 15:40 UTC, Jeff Mitchell
Details
debug log resulting from patched Amarok (49.10 KB, application/octet-stream)
2009-10-17 16:31 UTC, Matt Whitlock
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mathesis 2009-10-11 16:38:25 UTC
Version:            (using KDE 4.3.2)
OS:                Linux
Installed from:    Ubuntu Packages

Configuration:
- Ubuntu 9.10 Karmic Koala (Beta)
- KDE 4.3.2
- Amarok 2.2.0 (compilation date: October 8th 2009)

I encounter this trouble since the beginning of Amarok 2, I saw bugs in the past about this but I still experience the issue.

I have around 6000 music files (half *.ogg, half *.mp3) in various folders but one contains around 3000 files. If I try to add these folders to my collection at once, collection scanner fails in a range of 80 to 96%, it depends. By "Fails" I mean Amarok freezes. I have to kill it.

Up to now the workaround I found is to add folders progressively each after the other but it is impossible to add the one with around 3000 files. So I can only add half my collection...

Notes:
- Worked correctly with Amarok 1.4 (hope there will be a package for Karmic Koala...)
- Files in the "3000 files folder" are all *.mp3, tagged under windows in iso-8859-15 character set;
- collection_scan.log provides a file but the file indicated doesn't cause the issue.
Comment 1 Jeff Mitchell 2009-10-11 23:26:35 UTC
Please get a backtrace, or figure out which file(s) are causing the problem. A simply binary reduction should do it (put half the files in a different folder, try scanning each, if one of them still freezes, split again, etc.)

There isn't anything inherent in Amarok that keeps a folder of X files from being correctly scanned. I've never seen this kind of problem not be due to specific file(s).
Comment 2 mathesis 2009-10-12 23:17:09 UTC
I proceeded as you suggested, being less and less patient, I started to split by slices of 200-300 files, then 400-500 and finally 600-700 to realize that rather simply freeze it was taking a lot of time (more and more - not a linear behaviour).

So I decided to come back to the original configuration (with one folder with 3700 files and a total of 6900 files) and it took 1 minute to complete the scan at 95% and then 25 other minutes to get the process 100% completed.

So, the description of the bug is incorrect (in previous version of Amarok, half an hour was not sufficient that is why I though it was freezing) but there is a bug.

I assume that the cancellation of the scan takes time also and doesn't freeze as previously supposed.

Let me know if you think I can further help, but without backtrace, guilty file or whatever, I'm not sure I can do more.
Comment 3 Matt Whitlock 2009-10-16 09:01:53 UTC
This happens to me, too.  As far as I can tell, the actual scanning of files on disk is finished at 90%.  (Maybe it's 95% for mathesis, but it's 90% for me.)  Then the last portion of the progress bar is reserved for updating the database tables with the findings of the scan.  It's the database update that takes ages.  Maybe the queries aren't constructed or optimized well?
Comment 4 Myriam Schweingruber 2009-10-16 12:03:04 UTC
(In reply to comment #3)
> This happens to me, too.  As far as I can tell, the actual scanning of files on
> disk is finished at 90%.  (Maybe it's 95% for mathesis, but it's 90% for me.) 
> Then the last portion of the progress bar is reserved for updating the database
> tables with the findings of the scan.  It's the database update that takes
> ages.  Maybe the queries aren't constructed or optimized well?
And you also have Amarok 2.2.0, do you?
Comment 5 Matt Whitlock 2009-10-16 12:04:15 UTC
(In reply to comment #4)
> And you also have Amarok 2.2.0, do you?

Sorry, I should have mentioned.  Yes, I am running Amarok 2.2.0 on KDE 4.3.2.
Comment 6 Jeff Mitchell 2009-10-17 03:18:56 UTC
This is disconcerting, as I did a lot of work between 2.1 and 2.2 in order to eliminate exactly this condition (the SQL queries eating up the CPU for a long time after the scan).

Last I had people try things out, they seemed to still be working well. It's possible that there's some corner case going on, but it's something that will be hard to track down without having your collections.

What would be useful would be if you guys can put your collections in an external mysql database. This is useful because you can give the database an argument (I think it's just "log" and a filename) to have it store every query that is being run in a log file. You could then tail -f this file and see what's going on when Amarok is freezing, i.e. what subset of queries is being run. This might help me figure out what is going on.
Comment 7 Matt Whitlock 2009-10-17 07:50:14 UTC
Created attachment 37630 [details]
mysqld.log.bz2

(In reply to comment #6)
> What would be useful would be if you guys can put your collections in an
> external mysql database. This is useful because you can give the database an
> argument (I think it's just "log" and a filename) to have it store every query
> that is being run in a log file. You could then tail -f this file and see
> what's going on when Amarok is freezing, i.e. what subset of queries is being
> run. This might help me figure out what is going on.

I already have my collection in an external database.  Attached is the log showing Amarok starting up, doing a collection scan, and shutting down.  I touched my music directory before starting Amarok in order to trigger the scan.

1:32:26  Amarok connects.
1:32:30  Amarok GUI freezes (see bug 210761).
1:33:23  Amarok GUI resumes.
1:33:24  Amarok initiates collection scan.
1:33:51  Disk activity ceases, progress bar halts at 87%.
1:34:44  Collection scan completes.
1:35:23  User chooses to quit Amarok.
Comment 8 Jeff Mitchell 2009-10-17 09:15:15 UTC
Unfortunately, there aren't timestamps in that file so it's hard to correlate which queries are taking long amounts of time.

Can you please put the following in /etc/mysql/my.cnf and try this again?

In the [mysqld] section add/edit the following variables

long_query_time = 0
slow_query_log = 1
slow_query_log_file=/var/log/mysqld.slow.query.log

This will cause it to log the time taken by any query that takes more than 0 seconds, i.e. everything.

Thanks!
Comment 9 Matt Whitlock 2009-10-17 09:37:51 UTC
(In reply to comment #8)
> Unfortunately, there aren't timestamps in that file

Begging your pardon, but there are timestamps in the file.  MySQL only prints them when the time changes.  View the file in an editor that doesn't do line-wrapping (such as nano), and you'll see the timestamps very clearly.

Even so, I will do as you request and post another file momentarily.
Comment 10 Matt Whitlock 2009-10-17 09:50:50 UTC
Created attachment 37631 [details]
mysqld_slow_query.log.bz2

(In reply to comment #8)
> Can you please put the following in /etc/mysql/my.cnf and try this again?
Comment 11 Jeff Mitchell 2009-10-17 15:28:31 UTC
Matt,

Are your running Git? (i.e. if I made you some patches, would you be able to try them out and report back?)

Thanks,
Jeff
Comment 12 Matt Whitlock 2009-10-17 15:31:34 UTC
(In reply to comment #11)
> Are your running Git? (i.e. if I made you some patches, would you be able to
> try them out and report back?)

$ git --version
git version 1.6.5

I use git almost daily for work.  And yes, I will try whatever you come up with and report back.
Comment 13 Myriam Schweingruber 2009-10-17 15:35:08 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > Are your running Git? (i.e. if I made you some patches, would you be able to
> > try them out and report back?)
> 
> $ git --version
> git version 1.6.5

I think the question was if you are using the git version of Amarok :)
Comment 14 Jeff Mitchell 2009-10-17 15:40:11 UTC
Created attachment 37635 [details]
Adds debug output to some SQL queries

When you see "are you running Git" in the context of Amarok, it means are you building and running Amarok from a git clone (vs. distro packages)   :-)

The mysql logs are interesting because you seem to be running four queries per track that shouldn't be happening (since a single query is being run to get the same value and should be populating the caches). This would indeed cause a big slowdown, and was one of the changes that should have made it into 2.2.0 as a speedup, but it seems like something went wrong.

Anyways, please apply this patch and test it out. Be sure to start amarok with --debug (if you haven't been doing that you'll need to do it twice -- the first time disregard the debug output, then close Amarok and reopen with --debug).

This will probably generate a *lot* of console output, so you will probably want to clear backhistory right before running your scan, to keep things clear.

Thanks!
Comment 15 Matt Whitlock 2009-10-17 15:45:22 UTC
(In reply to comment #14)
> When you see "are you running Git" in the context of Amarok, it means are you
> building and running Amarok from a git clone (vs. distro packages)   :-)

I could build and use the git version if you'd prefer, but I am currently running 2.2.0 built by the Gentoo ebuild.

> Anyways, please apply this patch and test it out. Be sure to start amarok with
> --debug (if you haven't been doing that you'll need to do it twice -- the first
> time disregard the debug output, then close Amarok and reopen with --debug).

Okay, I will do this.  Do you want another mysqld_slow_query.log, or will the stderr+stdout from the patched Amarok be sufficient for debugging?

> This will probably generate a *lot* of console output, so you will probably
> want to clear backhistory right before running your scan, to keep things clear.

I'll redirect stderr->stdout and stdout-> a file.  Then I'll post that file here (probably bzip2'd again since it sounds like it'll be big).
Comment 16 Jeff Mitchell 2009-10-17 15:53:39 UTC
(In reply to comment #15)
> I could build and use the git version if you'd prefer, but I am currently
> running 2.2.0 built by the Gentoo ebuild.

Gentoo used to have a live ebuild; doesn't look like they do anymore, though, even in the kde overlay.

> Okay, I will do this.  Do you want another mysqld_slow_query.log, or will the
> stderr+stdout from the patched Amarok be sufficient for debugging?

stderr + stdout. The other output was very helpful though, thanks.

> I'll redirect stderr->stdout and stdout-> a file.  Then I'll post that file
> here (probably bzip2'd again since it sounds like it'll be big).

Sure, that's fine.
Comment 17 Matt Whitlock 2009-10-17 15:59:57 UTC
(In reply to comment #16)
> Gentoo used to have a live ebuild; doesn't look like they do anymore, though,
> even in the kde overlay.

You mean this?:
http://git.overlays.gentoo.org/gitweb/?p=proj/kde.git;a=blob;f=media-sound/amarok/amarok-9999.ebuild;h=902fb0569652144ee819ec829b550071fd64d9e4;hb=HEAD

I'll use that as a base for your patch since it doesn't apply against 2.2.0.
Comment 18 Jeff Mitchell 2009-10-17 16:10:31 UTC
Odd. My eix is messed up (although it finds other things in that overlay). Or they're not generating proper metadata for that ebuild...wouldn't be the first time.

Anyways, yes, that would work as a basis for the patch.
Comment 19 Matt Whitlock 2009-10-17 16:31:14 UTC
Created attachment 37636 [details]
debug log resulting from patched Amarok

I applied your patch (perfectly cleanly) against commit bdf8b49ca2c6e7a27088c940ef9c3636eab4fb0e, built and installed Amarok, ran it once with --debug, quit it, touched my Music dir, and ran it again with --debug with the output streams redirected to a file.  This attachment is that file, compressed with bzip2.
Comment 20 Jeff Mitchell 2009-10-17 16:58:58 UTC
Were your previous scans also incremental scans? This output doesn't really match what I'm expecting.

Do you see different freezing behavior for incremental vs. full scans?
Comment 21 Matt Whitlock 2009-10-17 17:07:26 UTC
(In reply to comment #20)
> Were your previous scans also incremental scans? This output doesn't really
> match what I'm expecting.
> 
> Do you see different freezing behavior for incremental vs. full scans?

I did exactly what I have been doing, which is to 'touch' my Music dir to update its mtime and then launch Amarok.  I don't know what constitutes an incremental scan versus a full scan.  It took about a minute for Amarok to dump all the data into the database at the completion of the disk scan (which took only a few seconds, since I suppose all the necessary disk blocks were already in memory).

Note, this git version still experiences the GUI hang while it looks through the directory structure on my removable media.  That occurs before the collection scan begins and takes about a minute, so in total there's about two minutes after I start Amarok before I can effectively use it, and that's when all my media files' tags are already in the block layer cache.
Comment 22 Egbert König 2009-10-18 13:18:27 UTC
Hello Jeff,

I will try to reproduce this issue with my installation.

Egbert
Comment 23 Egbert König 2009-10-18 15:03:09 UTC
Hello Jeff,

I see two problems:

1.: the progress bar is not updated during inserting into urls_temp, which takes up most of the time for a huge directory like in this case.

2.: before inserting/updating urls_temp this table is selected for each track. On my system (openSuSE 64 bit with 8GB RAM and 2,4 GHz Quad Core this makes up roughly between 0.5 and 1 sec per selecting plus inserting/updating one track. That will summarize up to 1500 - 3000 secs for 3000 tracks (while the progress bar is not updated)

What do you think about using a hash map for uniqueid and another one for deviceid plus rpath? So the selects can be avoided. I guess that the where clause over rpath is not scaling too good if many tracks are using the same directory.

I will try to use a hash map during the next days.
Comment 24 Jeff Mitchell 2009-10-18 15:35:25 UTC
Egbert,

I assume you are talking about the urlId function. In order to use local hash maps we'd have to read the entire table out first, load into hashmaps, do the processing there, and then load back out to the table. I assume that will still be faster -- what do you think?

I should be able to work on it tonight, so don't spend your time on it for now (unless you *really* want to  :-)  )

Also -- what do you think about queueing up the REPLACE INTO statements -- we could concatenate statements in the VALUES part, AFAIK, although we'd probably want to flush to DB at points so that the buffer in memory doesn't get too big. But you can do VALUES( blah, blah2, blah3, blah4), (blah5, blah6, blah7, blah8), right?

One thing I'm really curious about, however, is why some people saw such massive speedups in 2.2, and others saw massive speed decreases. Overall, the number of SQL queries per track was reduced by 4 or 5 queries per track, so I'm surprised it could have gotten slower for anyone at all.
Comment 25 Matt Whitlock 2009-10-18 15:44:31 UTC
Shouldn't you folks be using prepared statements to save the DBMS the work of parsing thousands of identically structured queries?  In my brief glance at the code, I saw what appeared to be expressions constructing SQL queries with data values inlined right into them.  Isn't it better to parse the statement once as a prepared statement and then just fill in the blanks and execute the statement as many times as you need?
Comment 26 Jeff Mitchell 2009-10-18 16:40:13 UTC
(In reply to comment #25)
> Shouldn't you folks be using prepared statements to save the DBMS the work of
> parsing thousands of identically structured queries?  In my brief glance at the
> code, I saw what appeared to be expressions constructing SQL queries with data
> values inlined right into them.  Isn't it better to parse the statement once as
> a prepared statement and then just fill in the blanks and execute the statement
> as many times as you need?

Maybe. You're probably making the mistake of thinking that we actually know anything about SQL.  :-P
Comment 27 Jeff Mitchell 2009-10-18 18:45:20 UTC
OK, I've looked into prepared statements, and it shouldn't be too hard to make use of them.

However.

Which do you (Egbert, Matt) think will be most beneficial: queueing up many values in the REPLACE INTO statements (thereby reducing the number of calls), or using a prepared statement (still same number of calls but without parsing). Or both. My guess is that the former would largely overpower the latter, but I'm not sure.
Comment 28 Egbert König 2009-10-18 19:15:11 UTC
Jeff,

you're right, I'm talking about the urlId function. I would use the hash maps to replace the selects from url_temp. I would leave the update and insert statements unchanged, as I believe, that the select statements take most of the time. The reason for this is, that in a setup with only few directories with thousands of tracks in each the key values are pretty similar and so long strings have to be compared before differences can be found.

I am not sure about the syntax for REPLACE INTO statements. This is - as off my knowledge - MySQL dialect. I'm not familiar with the MySQL database, I normally work with Ingres or PostgreSQL. I do not know REPLACE INTO from these databases.

(In reply to comment #24)
> Egbert,
> 
> I assume you are talking about the urlId function. In order to use local hash
> maps we'd have to read the entire table out first, load into hashmaps, do the
> processing there, and then load back out to the table. I assume that will still
> be faster -- what do you think?
> 
> I should be able to work on it tonight, so don't spend your time on it for now
> (unless you *really* want to  :-)  )
> 
> Also -- what do you think about queueing up the REPLACE INTO statements -- we
> could concatenate statements in the VALUES part, AFAIK, although we'd probably
> want to flush to DB at points so that the buffer in memory doesn't get too big.
> But you can do VALUES( blah, blah2, blah3, blah4), (blah5, blah6, blah7,
> blah8), right?
> 
> One thing I'm really curious about, however, is why some people saw such
> massive speedups in 2.2, and others saw massive speed decreases. Overall, the
> number of SQL queries per track was reduced by 4 or 5 queries per track, so I'm
> surprised it could have gotten slower for anyone at all.
Comment 29 Egbert König 2009-10-18 19:20:54 UTC
Jeff,

I do believe that you will get the best effect in the combination off:

- leaving out the frequent SELECT statemnts in urlId by using hash maps there

- using prepared INSERT/REPLACE/UPDATE statements

I do not believe that concatenating the INSERT/REPLACE statements will give much speedup. The data have still to be transferred from amarok to the server. Most database work with regard to these statements will be reorganizing the index structures, which will be similar for concatenated or single statements.

Prepared statements will save compilation and query optimization time in the database server.

(In reply to comment #27)
> OK, I've looked into prepared statements, and it shouldn't be too hard to make
> use of them.
> 
> However.
> 
> Which do you (Egbert, Matt) think will be most beneficial: queueing up many
> values in the REPLACE INTO statements (thereby reducing the number of calls),
> or using a prepared statement (still same number of calls but without parsing).
> Or both. My guess is that the former would largely overpower the latter, but
> I'm not sure.
Comment 30 Egbert König 2009-10-18 19:27:36 UTC
(In reply to comment #24)
The speedup is great as long as you have a directory tree with not too much tracks in a single directory. Having only few directories with many tracks in each has a couple of disadvantages:

- the index operations in the database do not scale good (-> long time to work on 1 track)
- thousands off files have to be scanned, whenever one track changes in a directory (-> long time per track * thousands off tracks == long waiting).

@mathesis: why don't you let amarok organize your files. It will lay them out in a nicely structured directory tree, and you will have good collection scanning performance.

> Egbert,
> 
...
> 
> One thing I'm really curious about, however, is why some people saw such
> massive speedups in 2.2, and others saw massive speed decreases. Overall, the
> number of SQL queries per track was reduced by 4 or 5 queries per track, so I'm
> surprised it could have gotten slower for anyone at all.
Comment 31 Jeff Mitchell 2009-10-18 20:07:13 UTC
(In reply to comment #29)
> - using prepared INSERT/REPLACE/UPDATE statements
> 
> I do not believe that concatenating the INSERT/REPLACE statements will give
> much speedup. The data have still to be transferred from amarok to the server.
> Most database work with regard to these statements will be reorganizing the
> index structures, which will be similar for concatenated or single statements.
> 
> Prepared statements will save compilation and query optimization time in the
> database server.

You seem to contradict yourself here, so let me try to clarify.

If I do multiple INSERT (which is similar to REPLACE INTO) value sets in one statement, then it only has to compile and optimize the query once, similar to setting up a prepared statement.

You say that the largest amount of time is probably Amarok transferring the data over to the server -- so then do prepared statements vs. many values in one statement make a big difference? Either way it's only compiled once, right?

I do know that with prepared statements you can use an on-wire binary protocol through the C API. I'm playing with that now, but it'll be a pain with the long INSERT for each track -- do you know how much it would actually help?
Comment 32 Jeff Mitchell 2009-10-18 20:08:32 UTC
(In reply to comment #30)
> (In reply to comment #24)
> The speedup is great as long as you have a directory tree with not too much
> tracks in a single directory. Having only few directories with many tracks in
> each has a couple of disadvantages:
> 
> - the index operations in the database do not scale good (-> long time to work
> on 1 track)

Any idea how to scale these better?

> - thousands off files have to be scanned, whenever one track changes in a
> directory (-> long time per track * thousands off tracks == long waiting).

Yes. In 2.2.1 (current Git) there are some optimizations to at least prevent unnecessary scanning of subdirectories in such a structure.
Comment 33 Egbert König 2009-10-18 22:01:21 UTC
Hello Jeff,

I do believe, that most time for the INSERT/REPLACE/UPDATE statements is spent on reorganizing the index. The time for data transfer between amarok and the database server will be approximately the same for single statements and for concatenated statements. Therefore data transfer times do probably not matter in *this* question. I would not try to use special binary transfer methods. If this was significant, we would also have bad scanning times with other directory layout structures.

Prepared statements will save compilation and query planning/optimization time in the database server. This *might* be significant. Therefore I would try this too.

I you concatenate the INSERT/REPLACE statements, you will also save the compilation time in the SQL server. But you might get huge SQL statements. There might be restrictions on the length of those statements and you might get failures because off that. Those failures might be hard to handle. Therefore I would not try to concatenate the INSERT/REPLACE statements now.

(In reply to comment #31)
> 
> You seem to contradict yourself here, so let me try to clarify.
> 
> If I do multiple INSERT (which is similar to REPLACE INTO) value sets in one
> statement, then it only has to compile and optimize the query once, similar to
> setting up a prepared statement.
> 
> You say that the largest amount of time is probably Amarok transferring the
> data over to the server -- so then do prepared statements vs. many values in
> one statement make a big difference? Either way it's only compiled once, right?
> 
> I do know that with prepared statements you can use an on-wire binary protocol
> through the C API. I'm playing with that now, but it'll be a pain with the long
> INSERT for each track -- do you know how much it would actually help?
Comment 34 Jeff Mitchell 2009-10-18 23:33:18 UTC
I have a branch where I'm trying prepared statements for some of the inserts but am hitting a segfault in my mysql c api. In the meantime I've been looking at storing some values in hash maps. Things get a bit complicated because of the checkExistingAlbums function.

You can do prepared statements in straight SQL which isn't as efficient, but the downside is that you have to set variables to the values you want to execute, which means another round trip to the server. I'm not sure whether that round trip + prepared statement execution is worth it, compared to simply having the (simple) syntax for the INSERTs to be parsed each time.

However, the hardest part is figuring out what/when hashmaps would be appropriate. If I transfer the contents of tables into hashmaps before scanning, you get a hit from transferring all that data from the SQL server, plus this could also result in huge memory usage if you're (say) doing an incremental scan and populating these with tens of thousands of string entries each. But mainly the time of doing that long lookup.

Given that, it seems to me like the two best things I can do right now are 1) getting rid of the SELECT in urlId by combining it with the statements in databaseIdFetch and 2) queueing up multiple inserts which would have a large impact (according to the mysql guys) but makes checkExistingAlbums quite complicated.
Comment 35 Jeff Mitchell 2009-10-19 00:05:23 UTC
Man, this is frustrating.

As it turns out (after I'd already coded it) you can't invoke the same temp table more than once in a union. So it worked in my CLI testing but fails in Amarok (http://bugs.mysql.com/bug.php?id=7742)

So I guess perhaps the answer is to push the URLs table into hashes (and just URLs) and do updating in there -- assuming that I can then make checkExistingAlbums work.
Comment 36 Matt Whitlock 2009-10-19 10:44:10 UTC
I'm not sure if you are locking the tables you are updating.  If you aren't, then Egbert has a point in comment #33 about the rewriting of the indexes being the overwhelming bottleneck.

From the MySQL manual:

«If you are going to run many operations on a set of MyISAM tables, it is much faster to lock the tables you are going to use. Locking MyISAM tables speeds up inserting, updating, or deleting on them because MySQL does not flush the key cache for the locked tables until UNLOCK TABLES is called. Normally, the key cache is flushed after each SQL statement.»

This does lend credence to the idea of inserting many rows in a compound INSERT statement (since the key cache will only be flushed after the entire statement executes), but I agree with Egbert's concerns that the statements will get really huge, and there are limits on query length in the server.

Why is the binary protocol a big deal?  Using prepared statements should be as simple as this:

http://forge.mysql.com/wiki/Connector_C++#Using_Prepared_Statements
Comment 37 Jeff Mitchell 2009-10-19 13:00:08 UTC
(In reply to comment #36)
> I'm not sure if you are locking the tables you are updating.  If you aren't,

Locking tables is ignored for temporary tables, because they're unique to a particular session anyways, although I guess there could be a benefit to locking the permanent tables at the point we're copying from the temporary tables back into the permanent ones.

> Why is the binary protocol a big deal?  Using prepared statements should be as
> simple as this:
> 
> http://forge.mysql.com/wiki/Connector_C++#Using_Prepared_Statements

I didn't say it was a big deal, I said I'm hitting a segfault.

Also, we use the C API, not the C++ one.
Comment 38 Matt Whitlock 2009-10-19 13:05:37 UTC
(In reply to comment #37)
> Also, we use the C API, not the C++ one.

This is interesting, since Amarok is a C++ program, isn't it?  Is the C API significantly faster?  I'm interested in why the decision was made to go with the C API, since it is likely that I will some day write a C++ program that wants to connect to a MySQL server.  Any pointers you could give would be appreciated.
Comment 39 Jeff Mitchell 2009-10-19 13:22:46 UTC
I'm not really sure. At this point, probably for historical reasons (much of the mysql code hasn't changed since the early days of Amarok 1, other than minor bugfixes). The official MySQL connector was only releasaed in late 2008, and switching our code to it is something that may be possible for e.g. 2.3, but we'd want to be very careful about. Certainly there are some good reasons for such a cleaner, easier API, but we also don't want to break things :-)

Also, there are various C++ wrappers, at least in Gentoo portage. None of them are officially from MySQL, and I don't know which (if any) would be on other distributions. So it avoids a potentially troublesome dependency. We could include the connector code in our code, but distros *really* don't like that.
Comment 40 Jeff Mitchell 2009-10-20 02:11:21 UTC
I just commited some code related to this. It's not as extensive as I'd like, but it's a start. 

Please, if you can, use a build from yesterday or so and scan twice (to make caches hot) and see how long it takes. Then do the same with current git (at least b563008) and see if there are speedups that are noticeable.

Thanks!
Comment 41 Matt Whitlock 2009-10-20 02:58:09 UTC
(In reply to comment #40)
> I just commited some code related to this. It's not as extensive as I'd like,
> but it's a start. 
> 
> Please, if you can, use a build from yesterday or so and scan twice (to make
> caches hot) and see how long it takes. Then do the same with current git (at
> least b563008) and see if there are speedups that are noticeable.
> 
> Thanks!

I switched to commit d4e384af4958ffc6885c941002a3b57a8f761dc5, which was the last commit before today.  However, the media-sound/amarok-utils-9999 ebuild now fails to build with the following error:

CMake Error at utilities/updatesigner/CMakeLists.txt:3 (find_package):                                                  
  Could not find module FindQCA2.cmake or a configuration file for package                                              
  QCA2.                                                                                                                 

  Adjust CMAKE_MODULE_PATH to find FindQCA2.cmake or set QCA2_DIR to the
  directory containing a CMake configuration file for QCA2.  The file will
  have one of the following names:                                        

    QCA2Config.cmake
    qca2-config.cmake

I do have app-crypt/qca-2.0.2-r2 installed, but there are no CMake files installed with it.

Do you suppose I can use amarok-utils-2.2.0 with amarok-9999?
Comment 42 Jeff Mitchell 2009-10-20 14:07:11 UTC
Try 11ff977b0731fd94dc5cde90d5d15d7a7251cc37 (before QCA was added) and current HEAD (disabled building of that module).

Unfortunately, it looks like I had made a regression during the commit, so it's likely that you're not going to see as much speedup as I'd hoped. Going to tackly the next method soon...
Comment 43 Matt Whitlock 2009-10-20 15:22:32 UTC
(In reply to comment #40)
> Please, if you can, use a build from yesterday or so and scan twice (to make
> caches hot) and see how long it takes. Then do the same with current git (at
> least b563008) and see if there are speedups that are noticeable.

1. Dropped database amarokdb.
2. Started Amarok.
3. Allowed scan to complete.
4. Quit Amarok.

I performed the above four steps twice in succession (there was almost no disk activity on the second run) and measured the elapsed time on the second run between launching Amarok and seeing the progress bar disappear.

11ff977b0731fd94dc5cde90d5d15d7a7251cc37: 2 minutes 0 seconds
72c81f3a3670723b43509a64a8b49108046dcc18: 1 minute 59 seconds

No significant difference.
Comment 44 Jeff Mitchell 2009-10-20 16:17:22 UTC
Yep, that's what I feared. It was damn fast when I tried it because of a bug (oversight); fixing it caused the number of queries to rise again.

I have another angle to attack which should trade off memory usage for speed increases, but the memory usage shouldn't be too significant depending on the size of one's collection (ten MB, maybe, for an *extremely* large collection. Most probably more like 1MB).

I have high hopes for this approach.
Comment 45 Jeff Mitchell 2009-10-23 04:40:09 UTC
Folks,

Please see http://mail.kde.org/pipermail/amarok/2009-October/009551.html and report back with benchmark results if you can.

Thanks,
Jeff
Comment 46 Matt Whitlock 2009-10-23 18:18:31 UTC
(In reply to comment #45)
> Folks,
> 
> Please see http://mail.kde.org/pipermail/amarok/2009-October/009551.html and
> report back with benchmark results if you can.
> 
> Thanks,
> Jeff

Even after restarting Amarok following a full scan, the "uidhash" version of Amarok shows only 2325 tracks in my collection, which is well short of the 5046 I should have.

On the upside it now takes only 1 minute 54 seconds to start up from a non-existent database, so you shaved a good 5-6 seconds off the time.  I'd like my whole collection present, though. :-/

Honestly, I don't think you saved too much by doing away with all those SELECTs.  What I think is really killing you is having to rewrite all the indexes after every INSERT.  A secondary drag may be that you're not pipelining queries to the server, so there's an awful lot of thread scheduling overhead.

Dangit, it's times like this where I wish I didn't need to work for a living so I could hop on the FOSS train and help you out with Amarok.
Comment 47 Jeff Mitchell 2009-10-23 18:52:13 UTC
> Even after restarting Amarok following a full scan, the "uidhash" version of
> Amarok shows only 2325 tracks in my collection, which is well short of the 5046
> I should have.

Yeah, working on fixing that. It wasn't a problem with my test collection because I have sensible tags. (With sensible meaning "not different cases of the same thing" in this instance :-) )

> Honestly, I don't think you saved too much by doing away with all those
> SELECTs.

It wasn't all about SELECTS. There are now no INSERTS (or SELECTS, or UPDATES) that take place except for the very beginning (loading the hashes, from about 8 SELECT statements) and the very end (writing back from the hashes).

The writing back is done using multiple values per INSERT. Essentially, I ask the database for the longest query length it can support, and then insert values in a single INSERT statement up until I hit that length, at which point I wrap around to a new INSERT. This means for most collections (even large ones), a single INSERT for 5 or 6 of the 8 or so tables that I made these hashes for, and only a couple INSERTS for the other larger ones (like tracks).

>  What I think is really killing you is having to rewrite all the
> indexes after every INSERT.

Which shouldn't be a problem anymore. At least, there are now mimimal INSERTs.

> A secondary drag may be that you're not pipelining
> queries to the server, so there's an awful lot of thread scheduling overhead.

Can you explain this?

> 
> Dangit, it's times like this where I wish I didn't need to work for a living so
> I could hop on the FOSS train and help you out with Amarok.

I work for a living too...
Comment 48 Matt Whitlock 2009-10-23 19:09:40 UTC
(In reply to comment #47)
> > A secondary drag may be that you're not pipelining
> > queries to the server, so there's an awful lot of thread scheduling overhead.
> 
> Can you explain this?

This isn't going to be as big a deal now since you're issuing far fewer queries.  When you submit a query to the DBMS, the processing time isn't just how long it takes the database to parse, optimize, and execute your query and encode and transmit the results; you also have the time that the data of the query and the data of the response spend sitting in the kernel's socket buffers waiting for the thread that is going to read them out to become the active thread.  If you submit each query and wait to read the response before moving on to submitting the next query, you're incurring the overhead of two thread context switches per query, and that indeed may dwarf the overhead incurred in actually executing the queries.  If you were to pipeline the queries (see the Wikipedia article on "HTTP pipelining"), you could dramatically decrease your overhead.
Comment 49 Jeff Mitchell 2009-10-26 19:10:54 UTC
Hi there,

All currently known bugs in that branch have been fixed, and we're planning a merge into mainline as of tomorrow. I'd appreciate if you could give this another look and see if you run into problems.

Thanks,
Jeff
Comment 50 Myriam Schweingruber 2009-11-26 14:10:40 UTC
Closing for lack of feedback. Amarok 2.2.1 has been released weeks ago, so unless somebody can reproduce this with a currenet 2.2-git build, no need to reopen this.