Bug 224857

Summary: Konqueror goes 100% CPU on some pages (possibly plug-in related)
Product: [Applications] konqueror Reporter: P. Varet <p.varet>
Component: generalAssignee: Konqueror Developers <konq-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: adawit, johannes.hirte, maksim, squan, stefan.fleckenstein
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: Traceback from a kill -11 during the infinite loop.
Scheduler bug that can potentically cause 100% CPU usage
Proposed fix for scheduler infinite loop...

Description P. Varet 2010-01-30 11:55:42 UTC
Version:           4.4rc2 (using Devel)
OS:                Linux
Installed from:    Compiled sources

Konqueror sometimes goes in an endless loops and eats 100% CPU until killed with the 'kill' command.

The issue seems to be related to communications with the Flash plug-in.

IMPORTANT NOTE: I browse with plug-ins turned OFF in the configuration. Yet Konqueror loads the plug-ins regardless; apparently it just doesn't display them in the page.


REPRODUCIBILITY: Difficult. It happens to me now and then, but beside the common element of an undisplayed Flash block in the page there seems to be no surefire way to trigger the bug.


BUG DETAILS:

Here's what an strace -p reports:

clock_gettime(CLOCK_MONOTONIC, {497664, 589660685}) = 0
poll([{fd=3, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=49, events=POLLIN}], 6, 0) = 0 (Timeout)
read(7, 0x1466574, 4096)                = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {497664, 589741142}) = 0
clock_gettime(CLOCK_MONOTONIC, {497664, 589779415}) = 0
clock_gettime(CLOCK_MONOTONIC, {497664, 589804279}) = 0
clock_gettime(CLOCK_MONOTONIC, {497664, 589841434}) = 0
clock_gettime(CLOCK_MONOTONIC, {497664, 589866577}) = 0
clock_gettime(CLOCK_MONOTONIC, {497664, 589903733}) = 0
read(7, 0x1466574, 4096)                = -1 EAGAIN (Resource temporarily unavailable)


More information about that file descriptor 7:
$ ls -l /proc/29883/fd/7
lrwx------ 1 xxx xxx 64 2010-01-30 11:43 /proc/29883/fd/7 -> socket:[419928]

$ cat /proc/29883/fdinfo/7
pos:    0
flags:  04002


Konqueror seems to fail to communicate with its nspluginviewer, but instead of aborting, it keeps trying in an endless loop.

Next time it happens I'll try to kill -11 the process to generate a traceback and try to see where the endless loop is happening.
Comment 1 P. Varet 2010-01-30 12:06:29 UTC
Created attachment 40379 [details]
Traceback from a kill -11 during the infinite loop.

Whew, didn't take long.
Apparently the issue may have to do with KIO scheduling.
Comment 2 Maksim Orlovich 2010-01-30 19:54:40 UTC
That code was introduced in r1071917. adawit?
Comment 3 Tommi Tervo 2010-01-30 20:11:05 UTC
*** Bug 224827 has been marked as a duplicate of this bug. ***
Comment 4 squan 2010-01-31 01:06:22 UTC
Tommi suggested that bug 224827 is a duplicate of this.
Since there I reported 100% reproducability for visiting imdb.com I straced that. This turns out that konqueror is frantically reading fd 7 in alternation with a poll on a set of several other descriptors. Here the part which gets endlessly repeated:
read(7, 0x8072198, 4096)                = -1 EAGAIN (Resource temporarily unavailable)
read(7, 0x8072198, 4096)                = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=20, events=POLLIN}, {fd=13, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=19, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=38, events=POLLIN}, {fd=30, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 19, 0) = 0 (Timeout)

If I can believe my strace output then file descriptor is a connection to the X-server:
socket(PF_FILE, SOCK_STREAM, 0)         = 7
connect(7, {sa_family=AF_FILE, path=@"/tmp/.X11-unix/X0"}, 20) = 0
getpeername(7, {sa_family=AF_FILE, path=@"/tmp/.X11-unix/X0"...}, [20]) = 0

Also konqueror stdout (luckily KDE betas are verbose on that) has something interesting:
nspluginviewer(30985)/nspluginviewer (Qt/Xt) main: 4 - create XtEvents and GlibEvents
nspluginviewer(30985)/nspluginviewer (Qt/Xt) main: 5 - dbus requestName
nspluginviewer(30985)/nspluginviewer (Qt/Xt) main: 6 - new NSPluginViewer
Connecting to deprecated signal QDBusConnectionInterface::serviceOwnerChanged(QString,QString,QString)
Comment 5 Tommi Tervo 2010-02-02 12:40:01 UTC
*** Bug 225256 has been marked as a duplicate of this bug. ***
Comment 6 Maksim Orlovich 2010-02-03 17:38:50 UTC
SVN commit 1084737 by orlovich:

The connection limit is clearly broken --- once we hit it, 
the scheduler spins at 100% CPU (startTimer(0)?) and doesn't
appear to actually make any progress, either. So, disable 
this for 4.4.0 (if I made it) as a workaround.

Might want to consider using the scheduler from trunk for 4.4.1...

CCMAIL: adawit@kde.org
BUG: 224857


 M  +2 -1      scheduler.cpp  


WebSVN link: http://websvn.kde.org/?view=rev&revision=1084737
Comment 7 squan 2010-02-03 21:36:55 UTC
> So, disable this for 4.4.0 (if I made it) as a workaround
Dear users, this is not meant as an instruction for you on how to make a workaround. Instead this means that a change to kdelibs has been applied as a quick fix for the 4.4 release.
Comment 8 Dawit Alemayehu 2010-02-04 00:00:23 UTC
Re-opened... The problem needs to be fixed correctly. If not for 4.4.0 for 4.4.1... Breaking one workaround for another one is not a solution. 

Can anyone one of you who run into this problem please tell share the spec of your machine ? I believe the issue stems from the problems identified above by several people but I cannot directly duplicate the issue at all. Perhaps on faster or slower machine the problem becomes apparent ?

I have already identified the culprit in this code that can potentially be the cause this issue. However, I am unable to duplicate the problems as described in this ticket or the ones that were marked duplicate of this ticket.
Comment 9 Maksim Orlovich 2010-02-04 00:42:39 UTC
It may make more sense to backport the trunk scheduler once proven stable than to spend more time on this one.
Comment 10 Dawit Alemayehu 2010-02-04 01:51:07 UTC
(In reply to comment #9)
> It may make more sense to backport the trunk scheduler once proven stable than
> to spend more time on this one.

That may be true for the long term. However, the fix you provided is not a solution for anyone using kdewebkit. It causes an explosion of slaves to be launched for simple requests. And I mean in the hundereds or possible thousands of ioslaves being launched for visiting few sites in several tabs.  As such this problem has to be addressed properly right away.

Anyhow, I already have a patch for the problem and I will post it here for people to test. Hopefully this time someone can provide a feedback to see if this problem is fixed with this patch...
Comment 11 Dawit Alemayehu 2010-02-04 01:55:47 UTC
Created attachment 40519 [details]
Scheduler bug that can potentically cause 100% CPU usage

This patch should address the high CPU usage reported in this bug. Need feedback from those that run into this issue to see if this proposed patch fixes your issues.

Thanks...
Comment 12 Maksim Orlovich 2010-02-04 03:53:30 UTC
That's not a fix, and it doesn't address the jobs not actually being dispatched.
Comment 13 Dawit Alemayehu 2010-02-04 04:50:02 UTC
(In reply to comment #12)
> That's not a fix, and it doesn't address the jobs not actually being
> dispatched.

It is... just not the whole or even the most important fix! I only found that out after I posted the patch by testing with the limits in my http(s) protocol files set to very very low values (4 max and 2 max per host). I made the timer change because firing the timer as soon as events are done (slaveTimer.start(0)) means recursively executing startStep without any pause. Image what happens if you have lots of queued up jobs. Makes no sense to add a simple delay of 1 sec. there. Specially since that point will never be reach if a single job is scheduled properly.

Anyhow that is obviously not the cause of the actual bug in the first place... The cause is as you mentioned the queued up jobs not being dispatched at all and the scheduler ending up in an infinite loop. That happens because jobs handled through startJobDirect are also incorrectly taken into consideration when the function checkLimits is called to enforce the limits. Since startJobDirect does not care or obey the limits from the protocols files, the number of active slaves can easily exceed the maximum amount specified there. 

Wait it gets even worse... When and how you encounter the bug can get even more complicated... Depending on what sites you visit in what sequence and how long those sites will keep connections alive as well as the timeout used to reclaim an ioslave all play a role in whether or not you will see this bug!

Anyhow, I already have a fix for the above problem as well and I will post it once I am satisfied with the results...
Comment 14 Dawit Alemayehu 2010-02-04 06:01:40 UTC
Comment on attachment 40519 [details]
Scheduler bug that can potentically cause 100% CPU usage

Obseleted by the fix iteration #2...
Comment 15 Dawit Alemayehu 2010-02-04 06:09:57 UTC
Created attachment 40520 [details]
Proposed fix for scheduler infinite loop...

Tested with both konqueror+khtml and konqueror+kwebkitpart plus very very small values for the maximum limits of ioslaves. The infinite loop should be addressed with this patch. Please test and provide feedback...
Comment 16 Johannes Hirte 2010-02-06 17:11:24 UTC
(In reply to comment #15)
> Created an attachment (id=40520) [details]
> Proposed fix for scheduler infinite loop...
> 
> Tested with both konqueror+khtml and konqueror+kwebkitpart plus very very small
> values for the maximum limits of ioslaves. The infinite loop should be
> addressed with this patch. Please test and provide feedback...

Seems that this patch has fixed it for me.
Comment 17 squan 2010-02-08 21:31:40 UTC
> please tell share the spec of your machine ?
I experience that on two out of two machines with 4.4 RC2 from opensuse 11.2 KDE repositories: an elder notebook and a dual core PC.
As already said with high reproducibility for imdb.com (if not on the home page then by following any page link).

> very very small values for the maximum limits of ioslaves
How can this be configured?
Comment 18 Dawit Alemayehu 2010-02-10 08:04:37 UTC
(In reply to comment #17)
> > please tell share the spec of your machine ?
> I experience that on two out of two machines with 4.4 RC2 from opensuse 11.2
> KDE repositories: an elder notebook and a dual core PC.
> As already said with high reproducibility for imdb.com (if not on the home page
> then by following any page link).
> 
> > very very small values for the maximum limits of ioslaves
> How can this be configured?

You can change the "maxSlaveInstances" and "maxSlaveInstancesPerHost" parameters in http{s}.protocol files. Simply search your system for those files and change the values for these two parameters.

Having said that my suggestion to you is to update to the final 4.4.0 release. It contains a temporary workaround for 4.4.0 and the next bug fix release, 4.4.1, will have correct fix for that will fix this issue...
Comment 19 Dawit Alemayehu 2010-02-10 08:07:18 UTC
SVN commit 1088091 by adawit:

Fixed bug# 224857 correctly and removed the temporary workaround for
4.4.1.

BUG: 224857


 M  +29 -18    scheduler.cpp  


WebSVN link: http://websvn.kde.org/?view=rev&revision=1088091