Version: 1.3.1 (using KDE KDE 3.2.3) Installed from: Debian testing/unstable Packages Compiler: gcc 3.3.4 OS: Linux On apparently random occasions, artsd hangs during or immediately after playing a sound. After a minute or so, the hanging process is terminated and a warning message shows up in a message box ("cpu overload"). If artsd was configured to run with highest priority, all the system hangs for that minute. In the following I'm trying to add as much information as I could think of being useful in no specific order ;-) All files I could verify the problem with where desktop sounds, i.e. should be wav, I assume. The machine is an 21164A. The sound system is alsa. artsd is running with the following command line (according to ps): /usr/bin/artsd -F 18 -S 4096 -s 2 -m artsmessage -c drkonqi -l 3 -f An snippet from strace output of a hanging artsd is appended at the end. Even when artsd "sleeps" and frees the sound device, I have two artsd processes running. Only one of them hangs, perhaps there's only one process left, but I'm not sure about that. Not relevant output in .xsession-errors. No relevant output in /var/log/messages. The previous release of artsd in Debian/testing didn't show these problems. -- System Information: Debian Release: 3.1 APT prefers testing APT policy: (990, 'testing') Architecture: alpha Kernel: Linux 2.4.26 Locale: LANG=de_DE@euro, LC_CTYPE=de_DE@euro Versions of packages libarts1 depends on: ii libartsc0 1.3.0-1 aRts Sound system C support librar ii libasound2 1.0.6-2 Advanced Linux Sound Architecture ii libaudio2 1.6d-2 The Network Audio System (NAS). (s ii libaudiofile0 0.2.6-4 Open-source version of SGI's audio ii libc6.1 2.3.2.ds1-16 GNU C Library: Shared libraries an ii libesd0 0.2.29-1 Enlightened Sound Daemon - Shared ii libgcc1 1:3.4.1-4sarge1 GCC support library ii libglib2.0-0 2.4.6-3 The GLib library of C routines ii libice6 4.3.0.dfsg.1-8 Inter-Client Exchange library ii libjack0.80.0-0 0.98.1-5 JACK Audio Connection Kit (librari ii libmad0 0.15.1b-1 MPEG audio decoder library ii libogg0 1.1.0-1 Ogg Bitstream Library ii libpng12-0 1.2.5.0-7 PNG library - runtime ii libqt3c102-mt 3:3.3.3-4.1 Qt GUI Library (Threaded runtime v ii libsm6 4.3.0.dfsg.1-8 X Window System Session Management ii libstdc++5 1:3.3.4-13 The GNU Standard C++ Library v3 ii libvorbis0a 1.0.1-1 The Vorbis General Audio Compressi ii libvorbisenc2 1.0.1-1 The Vorbis General Audio Compressi ii libvorbisfile3 1.0.1-1 The Vorbis General Audio Compressi ii libx11-6 4.3.0.dfsg.1-8 X Window System protocol client li ii libxext6 4.3.0.dfsg.1-8 X Window System miscellaneous exte ii libxt6 4.3.0.dfsg.1-8 X Toolkit Intrinsics ii xlibs 4.3.0.dfsg.1-8 X Window System client libraries m ii zlib1g 1:1.2.1.1-7 compression library - runtime
Created attachment 8728 [details] Patch to cure this bug I found out that the bug is caused by a broken pipe error on the alsa output channel. This error condition is tested in several methods of the class AudioIOALSA, also, there are methods available to cure the channel from this condition. I therefore assume the broken condition of the output channel to be something "given" and didn't try to find out the reason for it to occur. Such a test is missing in at least one method, and, unfortunately, it's exactly the place where a broken pipe would be detected after the modification of event dispatching between versions 1.2 and 1.3: After artsd wakes up from a select on the active file descriptors, the first alsa snd_* function called is, via AudioIOALSA::getParam, snd_pcm_avail_update. Its return value was not checked at all but directly run through snd_pcm_frames_to_bytes. In error case, this returns a negative number (a corresponding multiple of the true error code). The event processor only checks for the returned value > 0 and returns from the event handling method immediately. So, none of the methods that handle the broken pipe condition in AudioIOALSA ever gets called. As a result, the select statement of the event dispatcher immediately returns because bytes could be written to output, the event dispatcher calles the corresponding event handler which immediately returns because it assumes that it can't put any bytes to output. This results in the described busy hang of artsd, even a machine hang if artsd is run with "real time priority" settings. The patch I appended consists of copying the usual failure recovery from the broken pipe condition found e.g. in the playback methods AudioIOALSA to the getParam method. In practise, these work well, i.e. the audio channel has always been usable after returning from the method in all my tests. Additionally, I put a test into audiosubsys to abort artsd in the case that the event handling function for the "ready to write" event finds that no bytes are free on the output channel - this appears to be an obviously inconsistency in the sound system, no matter what the underlying implementation is. I did not test this, however, with other implementations than AudioIOALSA.
Nice, but you cannot assume you can write just because ALSA has restarted you. Far too many drivers are buggy and just restarts the client every now and then.
Created attachment 8734 [details] Restart pcm channels with PIPE error condition also from getParam method I now put the error checking/restart/xrun code into a separate method to be used from all methods where it may be relevant. The idea is that the handleError method checks a supplied return code for error conditions it can handle. If so, it will restart or xrun the relevant pcm channel and return -EAGAIN to the caller. So, getParam wraps handleError around the called function, and if the result is -EAGAIN, calles that function again. It already safes a lot of code duplication, though still the solution is not too elegant, as the caller still has to check for a certain error condition and re-call the failed function. To cure this design flaw, I can only think of a hierarchy of sound action objects that describe a specific alsa call each and are supplied to an issueCall method or so, that handles error checking and re-calling itself. But this again seems to be stupidly over-designed.
The problem is still there in Version 1.4.3
It's gone for a while. Probably with version 1.5.