94703 – random resolver failures "Unknown host ...."

Bug 94703 - random resolver failures "Unknown host ...."

Summary: random resolver failures "Unknown host ...."

Status:	RESOLVED FIXED

Alias:	None

Product:	kdelibs
Classification:	Unmaintained
Component:	general (show other bugs)
Version:	3.3.1
Platform:	Debian testing Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Thiago Macieira

URL:
Keywords:

Duplicates (2):	89613 99254 (view as bug list)
Depends on:
Blocks:

Reported:	2004-12-08 22:06 UTC by Gregory Stark
Modified:	2005-04-14 14:53 UTC (History)
CC List:	4 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Gregory Stark 2004-12-08 22:06:43 UTC

Version:           3.3.1 (using KDE KDE 3.3.1)
Installed from:    Debian testing/unstable Packages
OS:                Linux

For example, I just had to reload five times to get this page. I kept getting "Unknown host bugs.kde.org". On the fifth reload it worked fine.

It seems to happen in bursts. Once it starts happening it happens for a whole bunch of images on a page or for a page repeatedly every time i reload, until the problem subsides for a while.

I suspect the problem is related to how konqueror spams the name servers with a whole ton of queries for every image on the page. It doesn't cache results even for a single page rendering. I suspect the name server is rate-limiting responses to a single requestor either because of a bug in bind or as a defense against DDOS.

But I'm having trouble debugging this. I attached to konqueror with gdb and set breakpoints on gethostbyname and getaddrinfo expecting to be able to catch the failure to see what happened. But the breakpoint never triggered. How is konqueror managing to use the resolver without calling any resolver functions?

Comment 1 Roman Fietze 2005-01-25 12:58:44 UTC

Having a similar problem even for pages with not so many images (e.g. www.google.de). Trying the same page with Firefox retrieves the page w/o any problemevery time, doing a "host" or "nslookup" gives no errors at all, trying the same with konqueror gives me a "An error occurred while loading http://www.google.com/: Unknown host www.google.de" about every 2-4 times. Same machine, same user, same time, same everything.

This is with 3.3.2 Level "a" from the SuSE RPM's.

Comment 2 Thiago Macieira 2005-01-25 14:49:51 UTC

I can't reproduce.

Comment 3 Thiago Macieira 2005-02-13 02:46:04 UTC

Please reopen if you can reproduce with KDE 3.4 beta 2.

Comment 4 Thiago Macieira 2005-02-13 20:57:24 UTC

*** Bug 99254 has been marked as a duplicate of this bug. ***

Comment 5 Klaus S. Madsen 2005-03-22 11:19:39 UTC

I see the same thing both in KDE 3.3.2 and KDE 3.4. It especially happens when a page reloads itself (we have a network monitoring system, which refreshes the page every 30 seconds). If you create a page which reloads itself every 30 seconds you should see the error after a while. It seems random how quickly the bug shows up, but it usually happens a couple of times a day.

Comment 6 Thiago Macieira 2005-03-22 11:29:25 UTC

Is it possible to tell me what the DNS traffic was at the time the page failed to reload?

I think the problem is in your DNS server and/or Internet connection.

Comment 7 Klaus S. Madsen 2005-03-22 17:42:58 UTC

I can set something up, but I won't have time for it until next week. Also it can take more than an hour for the problem to arise. I can gather the data, but I can't see any way to correlate the data with the time where the browser fails? Any ideas? By the way, I know that there is no AAAA record for the domain (I have read some reports where the cause was missing IPv6 network. This isn't the case here).

However, normaly when I encounter it, it happens with an internal DNS-server on the LAN, which isn't loaded. When it happens, I can press reload, and the page appears, so it doesn't seem to be in periods like the original poster (I missed that when I wrote the first time). 

The problem started around version 3.3 (i think), and we have seen it on two different Gentoo installations, installed through emerge, and one Debian Woody installation, created with konstruct. It happens for different pages, but having a page which refreshes every 30 seconds seems to make it happen more often. A page like the one it usually happens on can be seen here:

http://www.emdrupborg.dk/sysorb/index.cgi?path=1.1&tld=Connectivity&username=viewer&passwd=viewer&server=localhost:3241

But it also happens for other pages, even some without images.

Comment 8 Thiago Macieira 2005-03-22 18:46:19 UTC

I don't need a traffic dump (tcpdump -w). A copy & paste of tcpdump's normal output will suffice (tcpdump -pn port 53). So, leave it running in the background. When the problem occurs, copy & paste the last screenful or so that it should be enough to indicate why the resolution failed.

If I am right, you will see unresponded queries.

There was a bug in glibc that caused DNS failures, but it should only affect people using DNS servers reached by IPv6.

Comment 9 Volker Kuhlmann 2005-03-22 21:23:24 UTC

Let me give a bit more info too. This problem is difficult because of
its transient nature. The replies so far indicate it's not
distro-specific (for me, all of SuSE 8.2, 9.1, 9.2). I decided to
concentrate on DNS resolving issues, although I am also pointing a
finger at KDE because mozilla has never for me shown this total failure
on these (I assume) resolver problems.

In my case I seriously doubt a problem with the net connection - I have
an extremely reliable cable connection. It is possible that the ISP's
name server was overloaded and randomly responded with "no data". That
there was a response is clear, if there hadn't been a response konqueror
would have sat idling until timeout but the no-domain error came as soon
as I clicked the link. It's also clear that the error "domain doesn't
exist" itself is utter bullocks. I tried 3 different ISPs' name servers,
and observed konqueror failures with all of them. Sorry, but I find that
hard to believe. I then set up a caching DNS on the local workstation
(bind 9, 60 seconds work in yast). No difference, regardless of which
ISP's name server I forwarded to. I then deleted all name servers in the
bind9 config, forcing resolution through root servers. No difference.
Problems with external name servers? Uhhhhm, I don't think so.

Summary: konquerer dies with bogus name resolution failures. The KDE
developers must understand that THIS PROBLEM IS CAUSED ENTIRELY ON THE
LOCAL WORKSTATION. STOP BLAMING OUTSIDE NAME SERVERS.

I then disabled ipv6 in the kernel (not so easy, as most instructions
are wrong for kernel 2.6, and SuSEfirewall2 in its default setting
forces loading of ipv6 modules, which are impossible to unload once
loaded). Much better, but I still got errors. Back to the locally
caching bind9 forwarding to ISP name server. I've seen no problems
since.

My conclusion: Either KDE/konqueror doesn't work with ipv6 (the claimed
fix is bogus) and it's still not working properly with ipv4 either, or
else the problem is somewhething else but for some reason it shows up
more often (but not only) in ipv6. It also seems to be restricted to
KDE.

Is there any debugging I could do, given above situation?

HTH,

Volker

Comment 10 Thiago Macieira 2005-03-22 22:00:31 UTC

Reopening the bug report. I am now convinced it's a local error. But please understand the situation: we use the standard name-resolution calls, the very ones Mozilla uses. So, in theory, either both should work, or both should fail. 

The only difference is that we do two calls at once, simultaneously, (in threads) while Mozilla sends the two queries in series, one after the other. So, again, in theory, we should even be faster by tens to hundreds of milliseconds, under normal circumstances.

Just to be sure: in you /etc/resolv.conf, have you ever had an IPv6 nameserver (i.e., nameserver ::1, or similar line)? However, if that were the problem, you'd be having issues in Mozilla as well.

Are you using KDE 3.4.0?

Comment 11 Volker Kuhlmann 2005-03-22 23:53:59 UTC

Sorry for not saying, thought I'd started this report.

I have the current updates for SuSE 9.2, which are KDE 3.3.0. I don't use the KDE packages from supplementary. The problem has been the same with earlier versions of SuSE and KDE, I think going back to 8.2 / KDE 3.1.1.

My /etc/resolv.conf:
nameserver 127.0.0.1
search site some.other.nz

I never had ::1 in there - perhaps I should have had. No ISP in New Zealand offers ipv6 so it's not much use to anyone here and I tend to ignore it.

Comment 12 Thiago Macieira 2005-03-23 00:57:15 UTC

Can anyone reproduce this at will? Or at least, after some trying, can get it to happen?

I cannot solve the problem if I can't find its source. An strace could help me.

Comment 13 sts 2005-03-31 13:28:55 UTC

I have the same problem here on different systems with different dns server. After reload the site works fine but its a poor usability and I mean it's a strong bug. With other browser works fine.

Comment 14 Thiago Macieira 2005-03-31 17:42:10 UTC

I know it is a big problem, but I can't solve it if I can't find it. I've said it already.

KDE resolves all hosts properly for me.

Comment 15 Stephan Kulow 2005-04-05 12:09:00 UTC

*** Bug 89613 has been marked as a duplicate of this bug. ***

Comment 16 Stephan Kulow 2005-04-05 12:22:17 UTC

please everyone: try to stop nscd and see if it stays reproducible

Comment 17 Volker Kuhlmann 2005-04-05 13:41:55 UTC

> please everyone: try to stop nscd and see if it stays reproducible


Been there, tried that. Still getting the same resolver errors. Wouldn't
other browsers go through nscd too? If so, those other browsers don't
show resolver problems. I doubt it has to do with nscd.

Volker

Comment 18 Thiago Macieira 2005-04-05 13:47:48 UTC

Can someone who can reproduce this problem try this:

killall kio_http
strace -o /tmp/kdeinit.trace -f -p <kdeinit's PID>

Then make the problem show and send us the trace file.

Just for the heck of it: can you also try to run "kdeinit" and see if the problem disappears?

Comment 19 Thiago Macieira 2005-04-08 13:59:15 UTC

CVS commit by thiago: 

Fixing the random resolver failures in the code. It was a local error
after all, so I apologise for being hard on the bug reporters. You
know how developers are protective of their own code :-)

Many thanks to the patient bug reporters and to Coolo for his analysis
of the problem.
BUG:94703

The reason this bug happened was quite insidious. It was related to
some events occurring in a very particular order in different threads,
that's why it appeared to be random. 
- the lookups are started (KResolver::start())
- KResolver::wait() is called on the master thread
- the lookups finish on the auxiliary threads
- the resolver code detects the auxiliary lookups being done and
  processes the results (KResolverManager::doNotifying()), thereby
  waking up all threads on KResolver::wait()
- the master thread is woken up now
- here's the catch: while the master thread is waking up, the manager
  thread has started processing the final results
  (KResolverManager::handleFinishedItem()) and sets status to
  KResolver::Success
- the master thread thinks the resolving is done and emits the
  finished(...) signal with an empty KResolverResult list!
- after that, the manager thread collects the auxiliary results,
  builds the main results and emits the signal again, but it's too late,
  since an error will have already been reported

After understanding the error, I am actually surprised it hasn't
happened more often, least of all with me. I am betting it's the
different threading implementations that cause the different
behaviour, or the fact that people were using dual-processor or
dual-core systems (which can do threading better than my single-core
CPU).


  M +3 -10     kresolvermanager.cpp   1.35


--- kdelibs/kdecore/network/kresolvermanager.cpp  #1.34:1.35
@@ -413,11 +413,5 @@ void KResolverManager::releaseData(KReso
   if (data->obj)
     {
-      if (data->nRequests > 0)
-        // PostProcessing means "we're done with our blocking stuff, but we're waiting
-        // for some child request to finish"
         data->obj->status = KResolver::PostProcessing;  
-      else
-        // this may change after post-processing
-        data->obj->status = data->worker->results.isEmpty() ? KResolver::Failed : KResolver::Success;
     }
       
@@ -484,5 +478,5 @@ bool KResolverManager::handleFinishedIte
       // this one has finished
       if (curr->obj)
-        curr->obj->status = KResolver::Success; // this may change after the post-processing
+        curr->obj->status = KResolver::PostProcessing; // post-processing is run in doNotifying()
 
       if (curr->requestor)
@@ -531,6 +525,5 @@ KResolverWorkerBase* KResolverManager::f
           // good, this one says it can process
           if (worker->m_finished)          
-            p->status = !worker->results.isEmpty() ?
-              KResolver::Success : KResolver::Failed;
+            p->status = KResolver::PostProcessing;
           else
             p->status = KResolver::Queued;

Comment 20 Thiago Macieira 2005-04-08 14:00:39 UTC

CVS commit by thiago: 

Backporting the "random resolver failure" problem to KDE 3.4.x.
BACKPORT:1.34:1.35
CCBUG:94703


  M +3 -10     kresolvermanager.cpp   1.34.2.1


--- kdelibs/kdecore/network/kresolvermanager.cpp  #1.34:1.34.2.1
@@ -413,11 +413,5 @@ void KResolverManager::releaseData(KReso
   if (data->obj)
     {
-      if (data->nRequests > 0)
-        // PostProcessing means "we're done with our blocking stuff, but we're waiting
-        // for some child request to finish"
         data->obj->status = KResolver::PostProcessing;  
-      else
-        // this may change after post-processing
-        data->obj->status = data->worker->results.isEmpty() ? KResolver::Failed : KResolver::Success;
     }
       
@@ -484,5 +478,5 @@ bool KResolverManager::handleFinishedIte
       // this one has finished
       if (curr->obj)
-        curr->obj->status = KResolver::Success; // this may change after the post-processing
+        curr->obj->status = KResolver::PostProcessing; // post-processing is run in doNotifying()
 
       if (curr->requestor)
@@ -531,6 +525,5 @@ KResolverWorkerBase* KResolverManager::f
           // good, this one says it can process
           if (worker->m_finished)          
-            p->status = !worker->results.isEmpty() ?
-              KResolver::Success : KResolver::Failed;
+            p->status = KResolver::PostProcessing;
           else
             p->status = KResolver::Queued;