Bug 251762 - Amarok only scans partial collection
Summary: Amarok only scans partial collection
Status: RESOLVED FIXED
Alias: None
Product: amarok
Classification: Applications
Component: Collections/Local (show other bugs)
Version: 2.4-GIT
Platform: openSUSE Linux
: NOR normal
Target Milestone: 2.4.0
Assignee: Amarok Developers
URL:
Keywords:
: 252136 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-09-19 22:24 UTC by Kelv
Modified: 2010-10-26 20:20 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In: 2.4


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kelv 2010-09-19 22:24:00 UTC
Version:           2.3.2-GIT (using KDE 4.5.1) 
OS:                Linux

I have a music library of over 17,000 files. When I do a full scan of my music library, it manages to get to about 15% of the total before the gauge in the corner stops completely and disappears. However, Amarok does not crash. I am then left with a collection of 2,363 tracks.

The console output has been posted under additional information.

I built a git version of Amarok about a week or two ago and this did error did not happen, this current git build does.

An old bug report seemed to have documented this bug on an old version - http://bugs.kde.org/show_bug.cgi?id=210096

Reproducible: Always

Steps to Reproduce:
1. Go to Configure > Collection
2. Click on 'Fully Rescan Entire Collection'


Actual Results:  
Incomplete collection (15% of actual collection)

Expected Results:  
I wanted my whole library to be in the collection.

Using the "--debug" option on the command line, this is outputted:

TagLib: MPEG::Header::parse() -- Invalid sample rate.
TagLib: MPEG::Header::parse() -- Invalid sample rate.
TagLib: MPEG::Header::parse() -- Invalid sample rate.
TagLib: MPEG::Header::parse() -- Invalid sample rate.
TagLib: MPEG::Header::parse() -- Invalid sample rate.
TagLib: MPEG::Header::parse() -- Invalid sample rate.
amarok:   [ScanManager] [ERROR!] Collection scanner abort error:  3 
amarok:   [ScanManager] Trying to restart the QXmlStreamReader.. 
amarok:   [ScanManager] Success. Committing result to database. 
amarok:   BEGIN: void DatabaseUpdater::cleanPermanentTables() 
amarok:   END__: void DatabaseUpdater::cleanPermanentTables() - Took 0.0026s 
amarok:   BEGIN: void ScanResultProcessor::copyHashesToTempTables() 
amarok:     [ScanResultProcessor] obtained max_allowed_packet is  "1048576" 
amarok:     [ScanResultProcessor] urls key size is  2363 
amarok:     [ScanResultProcessor] tracks key size is  2363 
amarok:   END__: void ScanResultProcessor::copyHashesToTempTables() - Took 0.25s 
amarok:   [ScanResultProcessor] temp_tracks:  ("2363") 
amarok:   [ScanResultProcessor] tracks before commit:  ("0") 
amarok:   BEGIN: void DatabaseUpdater::copyToPermanentTables() 
amarok:   END__: void DatabaseUpdater::copyToPermanentTables() - Took 0.15s 
amarok:   BEGIN: virtual void Dynamic::BiasedPlaylist::invalidate() 
amarok:   END__: virtual void Dynamic::BiasedPlaylist::invalidate() - Took 8.9e-05s 
amarok:   [ScanResultProcessor] tracks after commit:  ("2363") 
amarok:   BEGIN: void DatabaseUpdater::removeTemporaryTables() 
amarok:      Initialized thread, count== 5 
amarok:     BEGIN: void CollectionTreeItemModelBase::handleSpecialQueryResult(CollectionTreeItem::Type, Collections::QueryMaker*, const Meta::DataList&) 
amarok:       [CollectionTreeItemModelBase] Received special data:  0 
amarok:     END__: void CollectionTreeItemModelBase::handleSpecialQueryResult(CollectionTreeItem::Type, Collections::QueryMaker*, const Meta::DataList&) - Took 0.00079s 
amarok:   END__: void DatabaseUpdater::removeTemporaryTables() - Took 0.031s 
amarok:   BEGIN: void CollectionTreeItemModelBase::handleSpecialQueryResult(CollectionTreeItem::Type, Collections::QueryMaker*, const Meta::DataList&) 
amarok:     [CollectionTreeItemModelBase] Received special data:  11 
amarok:     [ScanResultProcessor] Sending changed signal 
amarok:   END__: virtual void XmlParseJob::run() - DELAY Took (quite long) 49s 
amarok: END__: void CollectionTreeItemModelBase::handleSpecialQueryResult(CollectionTreeItem::Type, Collections::QueryMaker*, const Meta::DataList&) - Took 0.061s 
amarok: BEGIN: void Collections::SqlCollection::updateTrackUrlsUids(const ChangedTrackUrls&, const QHash<QString, QString>&) 
amarok: END__: void Collections::SqlCollection::updateTrackUrlsUids(const ChangedTrackUrls&, const QHash<QString, QString>&) - Took 0.046s 
amarok: BEGIN: virtual void Dynamic::BiasedPlaylist::invalidate() 
amarok: END__: virtual void Dynamic::BiasedPlaylist::invalidate() - Took 7.6e-05s 
amarok: BEGIN: void CollectionTreeItemModelBase::handleSpecialQueryResult(CollectionTreeItem::Type, Collections::QueryMaker*, const Meta::DataList&) 
amarok:   [CollectionTreeItemModelBase] Received special data:  0 
amarok: END__: void CollectionTreeItemModelBase::handleSpecialQueryResult(CollectionTreeItem::Type, Collections::QueryMaker*, const Meta::DataList&) - Took 8.7e-05s 
amarok: BEGIN: virtual XmlParseJob::~XmlParseJob() 
amarok:   BEGIN: void StatusBar::hideProgress() 
amarok:   END__: void StatusBar::hideProgress() - Took 0.00043s 
amarok: END__: virtual XmlParseJob::~XmlParseJob() - Took 0.0012s 
amarok: BEGIN: void CollectionTreeItemModelBase::handleSpecialQueryResult(CollectionTreeItem::Type, Collections::QueryMaker*, const
Comment 1 Myriam Schweingruber 2010-09-20 01:16:00 UTC
Ralf or Jeff, could this be caused by some recently committed code?
Comment 2 Jeff Mitchell 2010-09-20 13:38:31 UTC
From the debug output, it looks like there is a problematic file in your collection that is causing the scanner to crash -- so after a few tries it gives up and only scans in what it has read up to that point. This is actually usually a problem with Taglib, not Amarok.

If you look in ~/.kde4/share/apps/amarok there should be a file called collection_scan.log. This will hold the last file considered by the scanner. If you move that file out of the containing folder it will probably continue (at least further -- if it stops again repeat that and see if it's another file, and so on).

Give that a try -- if it works, send a link (in private email) to where we can find the file(s) for testing.

Thanks!
Comment 3 Kelv 2010-09-20 15:16:57 UTC
(In reply to comment #2)
> From the debug output, it looks like there is a problematic file in your
> collection that is causing the scanner to crash -- so after a few tries it
> gives up and only scans in what it has read up to that point. This is actually
> usually a problem with Taglib, not Amarok.
> 
> If you look in ~/.kde4/share/apps/amarok there should be a file called
> collection_scan.log. This will hold the last file considered by the scanner. If
> you move that file out of the containing folder it will probably continue (at
> least further -- if it stops again repeat that and see if it's another file,
> and so on).
> 
> Give that a try -- if it works, send a link (in private email) to where we can
> find the file(s) for testing.
> 
> Thanks!

I forgot to add in the original post that my music collection is located on an external NTFS partition. Not sure how relevant that would be.

Jeff, I cannot seem to do this. When the gauge disappears at 15% and Amarok stops adding files to the collection, there are still TagLib messages being output to the console, with activity happening on my external hard drive. It appears to me that something is still scanning, but nothing is saved into the db. The collection.log file is changing with each file it scans. The problem is that I cannot find what file causes the stop at 15% because the TagLib messages pass so quickly and because they have no file name information.
 
For the hell of it I also tried changing the version of TagLib I compiled against from 1.6.3-0.pm.1.1 to 1.6.2-1.9 and the issue was still there.
Comment 4 Jeff Mitchell 2010-09-20 15:59:48 UTC
OK, let's attack this a different way.

amarokcollectionscanner is actually its own binary. Can you please try scanning with (this assumes you do recursive scanning):

amarokcollectionscanner -r [collection-folder]

Where [collection-folder] is the path to one of your collection folders. You can also specify more than one if you have more than one top-level folder, but this helps narrow it down a bit.

If a file is causing a problem, the output should stop without closing XML tags -- hopefully it'll be pretty apparent. 

See if you can figure out which file(s) are causing the problem this way...thanks!
Comment 5 Kelv 2010-09-21 01:53:32 UTC
(In reply to comment #4)
> OK, let's attack this a different way.
> 
> amarokcollectionscanner is actually its own binary. Can you please try scanning
> with (this assumes you do recursive scanning):
> 
> amarokcollectionscanner -r [collection-folder]
> 
> Where [collection-folder] is the path to one of your collection folders. You
> can also specify more than one if you have more than one top-level folder, but
> this helps narrow it down a bit.
> 
> If a file is causing a problem, the output should stop without closing XML tags
> -- hopefully it'll be pretty apparent. 
> 
> See if you can figure out which file(s) are causing the problem this
> way...thanks!

I managed to isolate some files that can reproduce the issue by modifying the source of ScanManager.cpp to output what file it was on before it gives "Collection scanner abort error: 3". I placed these files in a seperate folder (with their respective sub-directories) and ran the amarokcollectionscanner on these files, it listed all the files in the folder (about 28). When I ran Amarok and set the collection directory as this folder, it managed about 3.

I've tested this on the new Amarok 2.3.2 that was released yesterday and this still happens. :(

Jeff, I will send you a link via private email so you can test this against the same files/directory structure.
Comment 6 Jeff Mitchell 2010-09-21 20:38:40 UTC
Seems that Qt thinks that there is malformed XML from the scanner, which is too bad since Qt is the one creating the XML...

Still looking at it.
Comment 7 Jeff Mitchell 2010-09-21 23:28:44 UTC
Some updates...

There are two problems going on here. I've identified one of them; haven't identified the other yet.

The first problem is what appears to be a bug in Qt itself, which was exposed by some changes in Amarok's source before 2.3.2 to fix another bug. Many of the files you linked to have "<Unknown>" in the genre tag. When writing this out to XML, Qt should be being escaped like: "&lt;Unknown&gt;" However, what's actually happening is that it is being escaped as: "&lt;Unknown>", so when the processor at the other end is reading in the XML, it thinks that is the end of an entry and as such it dies saying that the entry is invalid.

We're trying to find a workaround for this problem; if we escape it before sending it off to be serialized into XML, it ends up double-escaping it. I fear that we will need to do a specialized sort of marking, replacing < with AMAROKXMLLT, > with AMAROKXMLGT, and so on when it's being written out; and then in the app replacing those back with the proper characters.

However, fixing just this still doesn't allow all tracks to appear, so there's another problem that I'm still figuring out.
Comment 8 Thomas Zander 2010-09-21 23:38:23 UTC
re c7, I think I found a thinko in your analisis. Both <foo bar="&lt;xx>"> and <foo>&lt;bar></foo> are valid XML and would be parsed properly in Qt AFAIK.
Comment 9 Jeff Mitchell 2010-09-22 01:06:52 UTC
I think your AFAIK is not correct. You're correct in the tag being "&lt;Unknown>" is fine -- Qt escaping half of it and not the other half isn't AFAICT problematic, but it *is* inconsistent.

But the QXmlStreamReader on the other end does error out when it sees this. I'm working on a test app showing the behavior.
Comment 10 Jeff Mitchell 2010-09-22 19:50:30 UTC
OK, I have answers.

The problem stems, at it root, from you having corrupted ID3v1 tags in your files. The comment field in those ID3v1 tags contains the track number, but not encoded as a digit -- encoded as the literal number. In other words, for one of your tracks, the digit 7 should have the ASCII decimal value 55, so the comment should contain a string made of the character with value 55; but instead, it is composed of a string made of the character with value 7, which is a control character (bell).

Amarok was seeing this, even though you have an ID3v2 tag, because by default TagLib will create a meta-tag that has information from both tags, favoring v2. Since there was no comment in your ID3v2 tag but there was in your ID3v1 tag, it was pulling in that comment field.

Previous to 2.3.2 this would have been thrown out because we were only allowing some types of characters though ("printable" characters), which excluded things like tabs and newlines. This was a bug in and of itself because some tags contained newlines and tabs (lyrics and comments, for instance), and also newlines are valid characters in filenames. So people with newlines in their comments were seeing comments not show up in their tag information, and this was something we needed to fix.

What happens next is that the tags are converted to XML. Unfortunately, although there is a way in Qt to check whether the value you are writing is valid or not, the specific method we are using has a bug where that check is bypassed. So the invalid character was now being coded into XML against the XML standard.

On the other side, when the XML was being read in, the character was causing the parser to die due to invalid XML being given to it.

So for the immediate future I've reinstated some checks. They allow nearly every character through except the surrogates and control characters that the XML spec forbids.

On the Amarok side, we need to decide whether to have TagLib only read ID3v2 tags. ID3v1 tags have so many problems that this is really a nice thing to do, but chances are there are still some people out there with only ID3v1 tags that will complain.

In the shorter term, you can either apply the patch that I'll commit (and send to this bug report), or you can strip ID3v1 tags from your files. I believe all of the problematic files are in the Nevermind directory, at least out of those you sent to me.
Comment 11 Silvian Cretu 2010-09-22 21:21:34 UTC
Hello guys,
I have the same problem. I have about 9000 songs in my collection. Amarok 2.3.1 scanned them all. Now, 2.3.2 (I've just upgraded; I use Kubuntu 10.04 with KDE 4.5.1) only "sees" about 5000 songs.
Comment 12 Jeff Mitchell 2010-09-23 00:14:25 UTC
Hi Silvian,

There will be a patch posted here that will fix this. If you read comment #10 you'll see that the cause is corrupted tags in files causing the parser to have problems...prior to 2.3.2 these were ignored, but valid characters were *also* ignored so people had problems (for instance) with files with lyrics or comments with newlines. As none of the testers had files with corrupt tags none of us encountered this problem prior to 2.3.2's release :-|

Once the patch is posted we'll ask packagers to rebuild the package with the patch; most of them probably will.
Comment 13 Jeff Mitchell 2010-09-23 00:20:18 UTC
commit 79d86829294ac54132c01153660e70e30c15c378
Author: Jeff Mitchell <mitchell@kde.org>
Date:   Wed Sep 22 18:15:17 2010 -0400

    Re-add some tests for unprintable but also invalid chars. Apparently Qt's XML classes don't properly check for invalid chars when writing XML, even if you tell them to.
    
    Also switch to QXmlStreamWriter, as apparently going forward it is the more supported class.
    
    BUG: 251762

diff --git a/utilities/collectionscanner/CollectionScanner.cpp b/utilities/collectionscanner/CollectionScanner.cpp
index 0a23a53..28c554b 100644
--- a/utilities/collectionscanner/CollectionScanner.cpp
+++ b/utilities/collectionscanner/CollectionScanner.cpp
@@ -37,13 +37,13 @@
 #include <QByteArray>
 #include <QDBusReply>
 #include <QDir>
-#include <QDomDocument>
 #include <QFile>
 #include <QtDebug>
 #include <QTextCodec>
 #include <QTextStream>
 #include <QTimer>
 #include <QThread>
+#include <QXmlStreamWriter>
 
 //Taglib:
 #include <apetag.h>
@@ -814,8 +814,10 @@ CollectionScanner::readTags( const QString &path, TagLib::AudioProperties::ReadS
 void
 CollectionScanner::writeElement( const QString &name, const AttributeHash &attributes )
 {
-    QDomDocument doc; // A dummy. We don't really use DOM, but SAX2
-    QDomElement element = doc.createElement( name );
+    QString text;
+    QXmlStreamWriter writer( &text );
+
+    writer.writeStartElement( name );
 
     QHashIterator<QString, QString> it( attributes );
     while( it.hasNext() )
@@ -829,7 +831,15 @@ CollectionScanner::writeElement( const QString &name, const AttributeHash &attri
         bool noCategory = false;
         for( unsigned i = 0; i < len; i++ )
         {
-            if( data[i].category() == QChar::NoCategory )
+            if( data[i].category() == QChar::NoCategory ||
+                data[i].category() == QChar::Other_Surrogate ||
+                ( 
+                    data[i].unicode() < 20 &&
+                    data[i].unicode() != 9 &&
+                    data[i].unicode() != 10 &&
+                    data[i].unicode() != 13
+                )
+            )
             {
                 noCategory = true;
                 break;
@@ -838,15 +848,12 @@ CollectionScanner::writeElement( const QString &name, const AttributeHash &attri
 
         if( noCategory )
             continue;
-
-        element.setAttribute( it.key(), it.value() );
+        writer.writeAttribute( it.key(), it.value() );
     }
 
-    QString text;
-    QTextStream stream( &text, QIODevice::WriteOnly );
-    element.save( stream, 0 );
+    writer.writeEndElement();
 
-    std::cout << text.toUtf8().data() << std::endl;
+    std::cout << text.toUtf8().data() << std::endl << std::endl;
 }
 
 // taken verbatim from Qt's sources, since it's stupidly in the QtGui module
Comment 14 Jeff Mitchell 2010-09-23 00:25:50 UTC
Sorry, I forgot it doesn't add the diff as an attachment. Here it is:

http://gitweb.kde.org/amarok/amarok.git/commitdiff_plain/79d86829294ac54132c01153660e70e30c15c378?hp=fd2a40d970c57fa2102e95de1a60c59e37892638

If some of those on this report could try this patch out and report back as to if it fixes your problem, that'd be great.

Thanks!
Comment 15 Kelv 2010-09-23 14:39:50 UTC
Yes it seems all is fixed now on the latest git build and I am able to scan my whole collection in. Thanks a lot!
Comment 16 Myriam Schweingruber 2010-09-23 16:21:50 UTC
*** Bug 252136 has been marked as a duplicate of this bug. ***
Comment 17 Marian Kyral 2010-09-23 17:13:46 UTC
(In reply to comment #14)
> Sorry, I forgot it doesn't add the diff as an attachment. Here it is:
> 
> http://gitweb.kde.org/amarok/amarok.git/commitdiff_plain/79d86829294ac54132c01153660e70e30c15c378?hp=fd2a40d970c57fa2102e95de1a60c59e37892638
> 
> If some of those on this report could try this patch out and report back as to
> if it fixes your problem, that'd be great.
> 
> Thanks!

Hi,
I can confirm that patch above applied to 2.3.2 fix this issue for me. Thanks.
Comment 18 Silvian Cretu 2010-09-24 13:15:07 UTC
An update has been pushed in Kubuntu's repositories as well. The issue is fixed on my side. Thank you!
Comment 19 wonderz 2010-10-25 09:51:30 UTC
I'm using Kubuntu 10.10 and can confirm that its repositories do not have a fixed version. Full collection scan only detects 1/5 of my tracks. Makes Amarok almost unusable for me. Is there a workaround?
Comment 20 Jeff Mitchell 2010-10-26 16:45:02 UTC
Apply the patch posted here or ask the Kubuntu maintainers to push out a version with the patch.
Comment 21 Jeff Mitchell 2010-10-26 18:14:22 UTC
Whoops -- this is the wrong bug report. There's a second patch (which I thought was the above-referenced patch) that will probably fix your problem -- it fixes a problem in the patch above. See 
http://gitweb.kde.org/amarok.git/blobdiff_plain/0b8577293698f88f2c6e5a1661ba570d0dd264b5..182f39c15575c3c1ce2b0425f9e3ec5997f3a9fd:/utilities/collectionscanner/CollectionScanner.cpp

That should probably fix your problem -- I'd be glad if you'd ask the Kubuntu maintainers to apply it.

Thanks,
Jeff
Comment 22 wonderz 2010-10-26 20:20:16 UTC
Thanks for your suggestions, me myself don't like to mess with source code. I solved the problem though, by removing all comments and suspicious album names from mp3 tags. Rescanned - and I have a full collection. Maybe this will help someone until the patch is applied in repositories.