Version: 1.3.1 (using KDE KDE 3.2.3)
Installed from: Debian testing/unstable Packages
Compiler: gcc 3.3.4
On apparently random occasions, artsd hangs during or immediately after
playing a sound. After a minute or so, the hanging process is terminated
and a warning message shows up in a message box ("cpu overload"). If
artsd was configured to run with highest priority, all the system hangs
for that minute.
In the following I'm trying to add as much information as I could think
of being useful in no specific order ;-)
All files I could verify the problem with where desktop sounds, i.e.
should be wav, I assume.
The machine is an 21164A.
The sound system is alsa.
artsd is running with the following command line (according to ps):
/usr/bin/artsd -F 18 -S 4096 -s 2 -m artsmessage -c drkonqi -l 3 -f
An snippet from strace output of a hanging artsd is appended at the end.
Even when artsd "sleeps" and frees the sound device, I have two artsd
processes running. Only one of them hangs, perhaps there's only one
process left, but I'm not sure about that.
Not relevant output in .xsession-errors. No relevant output in
The previous release of artsd in Debian/testing didn't show these
-- System Information:
Debian Release: 3.1
APT prefers testing
APT policy: (990, 'testing')
Kernel: Linux 2.4.26
Locale: LANG=de_DE@euro, LC_CTYPE=de_DE@euro
Versions of packages libarts1 depends on:
ii libartsc0 1.3.0-1 aRts Sound system C support librar
ii libasound2 1.0.6-2 Advanced Linux Sound Architecture
ii libaudio2 1.6d-2 The Network Audio System (NAS). (s
ii libaudiofile0 0.2.6-4 Open-source version of SGI's audio
ii libc6.1 2.3.2.ds1-16 GNU C Library: Shared libraries an
ii libesd0 0.2.29-1 Enlightened Sound Daemon - Shared
ii libgcc1 1:3.4.1-4sarge1 GCC support library
ii libglib2.0-0 2.4.6-3 The GLib library of C routines
ii libice6 4.3.0.dfsg.1-8 Inter-Client Exchange library
ii libjack0.80.0-0 0.98.1-5 JACK Audio Connection Kit (librari
ii libmad0 0.15.1b-1 MPEG audio decoder library
ii libogg0 1.1.0-1 Ogg Bitstream Library
ii libpng12-0 184.108.40.206-7 PNG library - runtime
ii libqt3c102-mt 3:3.3.3-4.1 Qt GUI Library (Threaded runtime v
ii libsm6 4.3.0.dfsg.1-8 X Window System Session Management
ii libstdc++5 1:3.3.4-13 The GNU Standard C++ Library v3
ii libvorbis0a 1.0.1-1 The Vorbis General Audio Compressi
ii libvorbisenc2 1.0.1-1 The Vorbis General Audio Compressi
ii libvorbisfile3 1.0.1-1 The Vorbis General Audio Compressi
ii libx11-6 4.3.0.dfsg.1-8 X Window System protocol client li
ii libxext6 4.3.0.dfsg.1-8 X Window System miscellaneous exte
ii libxt6 4.3.0.dfsg.1-8 X Toolkit Intrinsics
ii xlibs 4.3.0.dfsg.1-8 X Window System client libraries m
ii zlib1g 1:220.127.116.11-7 compression library - runtime
Created attachment 8728 [details]
Patch to cure this bug
I found out that the bug is caused by a broken pipe error on the alsa output
channel. This error condition is tested in several methods of the class
AudioIOALSA, also, there are methods available to cure the channel from this
condition. I therefore assume the broken condition of the output channel to be
something "given" and didn't try to find out the reason for it to occur.
Such a test is missing in at least one method, and, unfortunately, it's exactly
the place where a broken pipe would be detected after the modification of event
dispatching between versions 1.2 and 1.3:
After artsd wakes up from a select on the active file descriptors, the first
alsa snd_* function called is, via AudioIOALSA::getParam, snd_pcm_avail_update.
Its return value was not checked at all but directly run through
snd_pcm_frames_to_bytes. In error case, this returns a negative number (a
corresponding multiple of the true error code). The event processor only checks
for the returned value > 0 and returns from the event handling method
immediately. So, none of the methods that handle the broken pipe condition in
AudioIOALSA ever gets called.
As a result, the select statement of the event dispatcher immediately returns
because bytes could be written to output, the event dispatcher calles the
corresponding event handler which immediately returns because it assumes that
it can't put any bytes to output. This results in the described busy hang of
artsd, even a machine hang if artsd is run with "real time priority" settings.
The patch I appended consists of copying the usual failure recovery from the
broken pipe condition found e.g. in the playback methods AudioIOALSA to the
getParam method. In practise, these work well, i.e. the audio channel has
always been usable after returning from the method in all my tests.
Additionally, I put a test into audiosubsys to abort artsd in the case that the
event handling function for the "ready to write" event finds that no bytes are
free on the output channel - this appears to be an obviously inconsistency in
the sound system, no matter what the underlying implementation is. I did not
test this, however, with other implementations than AudioIOALSA.
Nice, but you cannot assume you can write just because ALSA has restarted you. Far too many drivers are buggy and just restarts the client every now and then.
Created attachment 8734 [details]
Restart pcm channels with PIPE error condition also from getParam method
I now put the error checking/restart/xrun code into a separate method to be
used from all methods where it may be relevant. The idea is that the
handleError method checks a supplied return code for error conditions it can
handle. If so, it will restart or xrun the relevant pcm channel and return
-EAGAIN to the caller. So, getParam wraps handleError around the called
function, and if the result is -EAGAIN, calles that function again.
It already safes a lot of code duplication, though still the solution is not
too elegant, as the caller still has to check for a certain error condition and
re-call the failed function. To cure this design flaw, I can only think of a
hierarchy of sound action objects that describe a specific alsa call each and
are supplied to an issueCall method or so, that handles error checking and
re-calling itself. But this again seems to be stupidly over-designed.
The problem is still there in Version 1.4.3
It's gone for a while. Probably with version 1.5.