Bug 286516

Summary: Nepomuk generating statements Virtuoso cannot complete
Product: [Unmaintained] nepomuk Reporter: Ryan Rix <ry>
Component: storageAssignee: Sebastian Trueg <sebastian>
Status: RESOLVED FIXED    
Severity: normal CC: ismail, trueg, wstephenson
Priority: NOR    
Version: git master   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In: 4.7.4
Sentry Crash Report:
Bug Depends on:    
Bug Blocks: 293421    

Description Ryan Rix 2011-11-13 18:38:04 UTC
Version:           git master (using Devel) 
OS:                Linux

Nepomuk generates queries that Virtuoso cannot finish processing, and thus uses 100% of all available cores.

|       727944 sparql select distinct ?r count(?p) as ?cnt where { ?r ?p ?o. filter( ?p in (<ht                                                                                                                                                                                                               |
|       687050 sparql select distinct ?r count(?p) as ?cnt where { ?r ?p ?o. filter( ?p in (<ht                                                                                                                                                                                                               |
|      3413120 sparql select distinct ?r count(?p) as ?cnt where { ?r ?p ?o. filter( ?p in (<ht                                                                                                                                                                                                               |
|      5501689 sparql select distinct ?r count(?p) as ?cnt where { ?r ?p ?o. filter( ?p in (<ht                                                                                                                                                                                                               |
|      1295661 sparql select distinct ?r count(?p) as ?cnt where { ?r ?p ?o. filter( ?p in (<ht                                                                                                                                                                                                               |
|           52 status()                                                                                                                                                                                                                                                                                       |
|      2805864 sparql select distinct ?r count(?p) as ?cnt where { ?r ?p ?o. filter( ?p in (<ht                                                                                                                                                                                                               |
|      3190447 sparql select distinct ?r count(?p) as ?cnt where { ?r ?p ?o. filter( ?p in (<ht                                                                                                                                                                                                               |

From isql

[/home/rrix/dev/install/usr/bin/nepomukservicestub] nepomukstorage(30280) Nepomuk::Sync::ResourceIdentifier::runIdentification: "select distinct ?r count(?p) as ?cnt where { ?r ?p ?o. filter( ?p in (<http://www.semanticdesktop.org/ontologies/2007/03/22/nco#emailAddress>) ). ?r a <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#EmailAddress> .  optional { ?r <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#emailAddress> ?
[11:22] <rrix> o0 . } . filter(!bound(?o0) || ?o0="mailman-bounces@lists.fedoraproject.org"^^<http://www.w3.org/2001/XMLSchema#string>). filter(  bound(?o0) ) . } order by desc(?cnt)"

From konsole output matching the time every one of those runaway queries starts

http://lxr.kde.org/source/kde/kdelibs/nepomuk-core/services/backupsync/lib/resourceidentifier.cpp#169 is the method that generates that query. 

Reproducible: Always

Steps to Reproduce:
Start nepomuk, wait.


Actual Results:  
virtuoso-t begins to spin all cores at 100%

Expected Results:  
Not... doing that :)
Comment 1 Ryan Rix 2011-11-13 18:47:48 UTC
Also, virtuoso versions: 

virtuoso-opensource-6.1.4-2.fc16.i686
virtuoso-opensource-apps-6.1.4-2.fc16.i686
virtuoso-opensource-utils-6.1.4-2.fc16.i686
virtuoso-opensource-doc-6.1.4-2.fc16.noarch
virtuoso-opensource-conductor-6.1.4-2.fc16.noarch

From Fedora 16 i686
Comment 2 Sebastian Trueg 2011-11-16 10:08:17 UTC
Very nice report. Thanks a lot. It was already on my todo list. :)
Comment 3 Ryan Rix 2011-11-16 10:34:34 UTC
Great to hear :) I spent a few hours with PvK trying to get things narrowed down, so I'm glad to see it's already being taken care of. Thanks Sebas, you rock hard :)
Comment 4 Sebastian Trueg 2011-11-24 17:10:13 UTC
Git commit 41ecd6d72b242c856153c93dd4a2efaec3c2d8e2 by Sebastian Trueg.
Committed on 24/11/2011 at 18:02.
Pushed by trueg into branch 'master'.

Performance optimization in resource identification.

In case only a single identifying property exists we get a much much
much much faster query when avoiding all the optional and filter
terms.
Since this is the case for email identification this optimization
does actually make a difference in email indexing.

BUG: 286516

M  +31   -13   services/backupsync/lib/resourceidentifier.cpp

http://commits.kde.org/nepomuk-core/41ecd6d72b242c856153c93dd4a2efaec3c2d8e2
Comment 5 Sebastian Trueg 2011-11-24 17:11:43 UTC
Git commit 08683854eab048ff76b188233b7285f3e6234810 by Sebastian Trueg.
Committed on 24/11/2011 at 18:11.
Pushed by trueg into branch 'master'.

Performance optimization in resource identification.

In case only a single identifying property exists we get a much much
much much faster query when avoiding all the optional and filter
terms.
Since this is the case for email identification this optimization
does actually make a difference in email indexing.

CCBUG: 286516

M  +31   -13   nepomuk/services/backupsync/lib/resourceidentifier.cpp

http://commits.kde.org/kde-runtime/08683854eab048ff76b188233b7285f3e6234810
Comment 6 Sebastian Trueg 2011-11-24 17:16:39 UTC
Git commit e8fa5d5cee2070cccab5286cd0859baee07a618e by Sebastian Trueg.
Committed on 24/11/2011 at 18:11.
Pushed by trueg into branch 'KDE/4.7'.

Performance optimization in resource identification.

In case only a single identifying property exists we get a much much
much much faster query when avoiding all the optional and filter
terms.
Since this is the case for email identification this optimization
does actually make a difference in email indexing.

CCBUG: 286516

M  +31   -13   nepomuk/services/backupsync/lib/resourceidentifier.cpp

http://commits.kde.org/kde-runtime/e8fa5d5cee2070cccab5286cd0859baee07a618e
Comment 7 Will Stephenson 2011-11-25 11:56:16 UTC
This commit is to the backupsync service in kde-runtime - how does it affect email indexing, happening in kdepim-runtime and talking to the storage service?
Comment 8 Sebastian Trueg 2011-11-25 12:02:23 UTC
(In reply to comment #7)
> This commit is to the backupsync service in kde-runtime - how does it affect
> email indexing, happening in kdepim-runtime and talking to the storage service?

There is a weird dependency between storage and backup service.
Comment 9 Ryan Rix 2011-11-28 08:35:20 UTC
I am still having this issue, perhaps with a different filter query this time around. I will reopen this when I can find which query is causing it
Comment 10 Ryan Rix 2011-12-05 08:10:11 UTC
nepomukservices A628AB40 ENTER SQLExecDirect
                SQLHSTMT          0x9fcae70
                SQLCHAR         * 0xa0d7f30
                                  | sparql select distinct ?r count(?p) as ? |
                                  | cnt where { ?r ?p ?o. filter( ?p in (<ht |
                                  | tp://www.semanticdesktop.org/ontologies/ |
                                  | 2007/08/15/nao#prefLabel>,<http://www.se |
                                  | manticdesktop.org/ontologies/2007/03/22/ |
                                  | nco#fullname>) ). ?r a <http://www.seman |
                                  | ticdesktop.org/ontologies/2007/03/22/nco |
                                  | #PersonContact> .  optional { ?r <http:/ |
                                  | /www.semanticdesktop.org/ontologies/2007 |
                                  | /08/15/nao#prefLabel> ?o0 . } . filter(! |
                                  | bound(?o0) || ?o0="Alugue Temporada SP") |
                                  | .  optional { ?r <http://www.semanticdes |
                                  | ktop.org/ontologies/2007/03/22/nco#fulln |
                                  | ame> ?o1 . } . filter(!bound(?o1) || ?o1 |
                                  | ="Alugue Temporada SP"^^<http://www.w3.o |
                                  | rg/2001/XMLSchema#string>). filter(  bou |
                                  | nd(?o0) ||  bound(?o1) ) . } order by de |
                                  | sc(?cnt)                                 |

looks like it's another query one that is causing me this issue. hate to re-open bugs on you, but this one is missing a SQL_SUCCESS ;)
Comment 11 Ryan Rix 2011-12-05 08:12:06 UTC
Ohwait, that one does have a SQL_SUCCESS >.< Let's see if I can find the right one while it's still in my trace.
Comment 12 Ryan Rix 2011-12-05 08:22:19 UTC
                                  | sparql select distinct ?r count(?p) as ? |
                                  | cnt where { ?r ?p ?o. filter( ?p in (<ht |
                                  | tp://www.semanticdesktop.org/ontologies/ |
                                  | 2007/03/22/nco#emailAddress>) ). ?r a <h |
                                  | ttp://www.semanticdesktop.org/ontologies |
                                  | /2007/03/22/nco#EmailAddress> . ?r <http |
                                  | ://www.semanticdesktop.org/ontologies/20 |
                                  | 07/03/22/nco#emailAddress> "ry@n.rix.si" |
                                  | ^^<http://www.w3.org/2001/XMLSchema#stri |
                                  | ng> . } order by desc(?cnt)              |

Is not completing, for some reason, not exactly sure why or how to look in to it further. Any debugging help would be greatly appreciated :)
Comment 13 Sebastian Trueg 2011-12-05 09:34:01 UTC
(In reply to comment #12)
>                                   | sparql select distinct ?r count(?p) as ? |
>                                   | cnt where { ?r ?p ?o. filter( ?p in (<ht |
>                                   | tp://www.semanticdesktop.org/ontologies/ |
>                                   | 2007/03/22/nco#emailAddress>) ). ?r a <h |
>                                   | ttp://www.semanticdesktop.org/ontologies |
>                                   | /2007/03/22/nco#EmailAddress> . ?r <http |
>                                   | ://www.semanticdesktop.org/ontologies/20 |
>                                   | 07/03/22/nco#emailAddress> "ry@n.rix.si" |
>                                   | ^^<http://www.w3.org/2001/XMLSchema#stri |
>                                   | ng> . } order by desc(?cnt)              |
> 
> Is not completing, for some reason, not exactly sure why or how to look in to
> it further. Any debugging help would be greatly appreciated :)

Are you sure this is it? This one is lightning fast here. Can you maybe try it in nepomukshell? Just for simplicity here is the cleaned up query:

select distinct ?r count(?p) as ?cnt where { ?r ?p ?o. filter( ?p in (nco:emailAddress) ). ?r a nco:EmailAddress . ?r nco:emailAddress "ry@n.rix.si"^^xsd:string . } order by desc(?cnt)
Comment 14 Ryan Rix 2011-12-06 08:56:12 UTC
I ran that query in nepsak three times, I now have these in isql:


|       522069 sparql prefix nco: <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#> p                                                                                                                                                                                                               |
|       596383 sparql prefix nco: <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#> p                                                                                                                                                                                                               |
|       505492 sparql prefix nco: <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#> p                                                                                                                                                                                                               |

Does nepsak rewrite or translate these queries any when they are ran? 


                                  | sparql prefix nco:<http://www.semanticde |
                                  | sktop.org/ontologies/2007/03/22/nco#>SEL |
                                  | ECT DISTINCT ?person WHERE {   graph ?g  |
                                  | {     ?person <http://akonadi-project.or |
                                  | g/ontologies/aneo#akonadiItemId> ?itemId |
                                  |  .     ?person a nco:PersonContact ;     |
                                  |          nco:hasEmailAddress ?email .    |
                                  |   ?email nco:emailAddress "ry@n.rix.si"^ |
                                  | ^<http://www.w3.org/2001/XMLSchema#strin |
                                  | g> .   } }                               |

Does that make sense at all, or am I chasing my tail in the wrong directions because it's 2:00 and i've been working since 10? :)
Comment 15 Sebastian Trueg 2011-12-06 09:35:33 UTC
This is rather confusing since these are pretty simple queries that complete in no time for me. Did you finally see any results in nepomukshell?
Comment 16 Sebastian Trueg 2011-12-06 16:05:30 UTC
Git commit 2936c781f01614a2e5c01f558e2e0f36affc0739 by Sebastian Trueg.
Committed on 24/11/2011 at 18:02.
Pushed by trueg into branch 'symlinkHandling'.

Performance optimization in resource identification.

In case only a single identifying property exists we get a much much
much much faster query when avoiding all the optional and filter
terms.
Since this is the case for email identification this optimization
does actually make a difference in email indexing.

BUG: 286516

M  +31   -13   services/backupsync/lib/resourceidentifier.cpp

http://commits.kde.org/nepomuk-core/2936c781f01614a2e5c01f558e2e0f36affc0739
Comment 17 Ryan Rix 2011-12-19 08:49:46 UTC
After about 600000ms it has not completed :(

|       674988 sparql prefix nco: <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#> p
Comment 18 Ryan Rix 2011-12-24 18:16:51 UTC
So I backed up what I could and nerfed my nepomuk DB earlier this week, to see if this exists on a 'fresh' database, or if I'd had some graphs that creeped in that virtuoso couldn't digest. I don't have this problem any more, so I guess RESOLVE FIXED :) but I do have one or two new ones that i'll report separately after searching.

happy holidays, btw, nepomukhackers :)
Comment 19 Ryan Rix 2012-01-04 00:23:52 UTC
I've found another one that's doing this, this time for nepomukqueryservice:

select distinct ?r where { { ?r a <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#Tag> . ?v2 <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#hasTag> ?r . ?v3 <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#hasTag> ?r . } . ?r <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#userVisible> ?v1 . FILTER(?v1>0) . } ORDER BY DESC ( count(?v3) ) LIMIT 6