SUMMARY Was having some stability problems (see also bug #406638) so I decided to run kstars under Valgrind on amd64. Valgrind reported a large number of troubling uninitialized data errors, but the most critical one I saw after a bit of testing was a clear use-after-free. STEPS TO REPRODUCE 1. Start kstars under valgrind with some appropriate suppressions. 2. Open EKOS and start indi with simulators. I used a guider as well as a CCD. 3. This error appeared around the time I clicked "capture" in the Guide window, though it seemed somewhat random. OBSERVED RESULT org.kde.kstars.ekos.align: "Capturing image..." org.kde.kstars.fits: Loading FITS file "/tmp/fitsu28517.fits" Found one coordinate representation. org.kde.kstars.ekos.align: "Image received." org.kde.kstars.ekos.align: "Capturing dark frame..." org.kde.kstars.fits: Loading FITS file "/tmp/fitsN28517.fits" org.kde.kstars.ekos.align: "Dark frame received." org.kde.kstars.fits: Loading FITS file "/tmp/fitsN28517.fits" org.kde.kstars.fits: Saved FITS file: "/home/elladan/.local/share/kstars/darks/darkframe_2019-04-26T18-42-24.fits" org.kde.kstars.ekos.align: "Dark frame saved to /home/elladan/.local/share/kstars/darks/darkframe_2019-04-26T18-42-24.fits" org.kde.kstars.ekos.align: "Starting solver..." ==28517== Thread 8 Thread (pooled): ==28517== Invalid write of size 1 ==28517== at 0x35114C: operator() (fitsview.cpp:603) ==28517== by 0x35114C: QtConcurrent::StoredFunctorCall0<void, bool FITSView::rescale<unsigned short>(FITSZoom)::{lambda()#1}>::runFunctor() (qtconcurrentstoredfunctioncall.h:70) ==28517== by 0x35017A: QtConcurrent::RunFunctionTask<void>::run() (qtconcurrentrunbase.h:136) ==28517== by 0xA4392B1: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Core.so.5.9.5) ==28517== by 0xA43C17C: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Core.so.5.9.5) ==28517== by 0xB5526DA: start_thread (pthread_create.c:463) ==28517== by 0xC7EE88E: clone (clone.S:95) ==28517== Address 0x7b56f418 is 730,072 bytes inside a block of size 24,000,000 free'd ==28517== at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==28517== by 0x9886A44: QImageData::~QImageData() (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x9886AB6: QImage::~QImage() (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x9887C98: QImage::detach() (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x9887D26: QImage::scanLine(int) (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x3510F7: operator() (fitsview.cpp:599) ==28517== by 0x3510F7: QtConcurrent::StoredFunctorCall0<void, bool FITSView::rescale<unsigned short>(FITSZoom)::{lambda()#1}>::runFunctor() (qtconcurrentstoredfunctioncall.h:70) ==28517== by 0x35017A: QtConcurrent::RunFunctionTask<void>::run() (qtconcurrentrunbase.h:136) ==28517== by 0xA4392B1: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Core.so.5.9.5) ==28517== by 0xA43C17C: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Core.so.5.9.5) ==28517== by 0xB5526DA: start_thread (pthread_create.c:463) ==28517== by 0xC7EE88E: clone (clone.S:95) ==28517== Block was alloc'd at ==28517== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==28517== by 0x9886D1E: QImageData::create(QSize const&, QImage::Format) (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x9886E9A: QImage::QImage(QSize const&, QImage::Format) (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x9886ED4: QImage::QImage(int, int, QImage::Format) (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x98874B2: QImage::copy(QRect const&) const (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x9887C7E: QImage::detach() (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x9887D26: QImage::scanLine(int) (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.5) ==28517== by 0x3510F7: operator() (fitsview.cpp:599) ==28517== by 0x3510F7: QtConcurrent::StoredFunctorCall0<void, bool FITSView::rescale<unsigned short>(FITSZoom)::{lambda()#1}>::runFunctor() (qtconcurrentstoredfunctioncall.h:70) ==28517== by 0x35017A: QtConcurrent::RunFunctionTask<void>::run() (qtconcurrentrunbase.h:136) ==28517== by 0xA4392B1: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Core.so.5.9.5) ==28517== by 0xA43C17C: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Core.so.5.9.5) ==28517== by 0xB5526DA: start_thread (pthread_create.c:463) ==28517== EXPECTED RESULT Use before free. :-) SOFTWARE/OS VERSIONS Kstars: Build: 2019-04-14T19:19:24Z Linux Mint 19.1 (Tessa)
Thanks for looking into this. So right using the threading caused some regressions. I'll look into this, since you already probably know more about this than I do, any idea what sequence of event leads to this? How is the FITSView image freed before rescale? how does this happen?
Bug #407074 is related. Any idea what sequence of events leads to use after free? Still trying to pin out the issue so any help would be appreciated.
(In reply to Jasem Mutlaq from comment #2) > Bug #407074 is related. > > Any idea what sequence of events leads to use after free? Still trying to > pin out the issue so any help would be appreciated. I'll attempt a more specific repro as soon as I get a chance.
Git commit 161d65aa7d10bd7f17b6a7006e8f3bde24f2227f by Jasem Mutlaq. Committed on 01/05/2019 at 08:25. Pushed by mutlaqja into branch 'master'. If previous image is not done loading, wait until the future is completed. This could fix: Related: bug 407088, bug 407089, bug 407074 FIXED-IN:3.2.2 M +2 -0 kstars/fitsviewer/fitsview.cpp https://commits.kde.org/kstars/161d65aa7d10bd7f17b6a7006e8f3bde24f2227f
I built kstars with your bug fix and then attempted to repro under valgrind. I did not see any use after free errors. Great! Note that running kstars under valgrind is somewhat frustrating. You do need a release build or it's simply too slow, and I needed a number of suppressions to deal with the 15 million errors generated in various system libraries (especially DRM). In addition, you need to be very patient with a number of the GUI actions. I found that kstars would often deadlock when doing things like starting or stopping the indi server, and I was forced to kill it (and crashed my window manager) several times. Clicking around while the indi server connection (slowly) processed would certainly put kstars in a bad state. Here's an example where I had to manually terminate the process (the SEGV is artificial): ==29697== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==29697== at 0xB333D2D: __pthread_timedjoin_ex (pthread_join_common.c:89) ==29697== by 0xBC2B8C2: std::thread::join() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25) ==29697== by 0x8D122E: INDI::BaseClient::disconnectServer() (in /home/elladan/astro/dev/kstars/build/kstars/kstars/kstars) ==29697== by 0x42D331: DriverManager::stopDevices(QList<DriverInfo*> const&) (drivermanager.cpp:464) ==29697== by 0x4A40B8: Ekos::Manager::cleanDevices(bool) (manager.cpp:984) ==29697== by 0x4A4141: Ekos::Manager::stop() (manager.cpp:492) ==29697== by 0xA42977E: QMetaObject::activate(QObject*, int, int, void**) (in /usr/lib/x86_64-linux-gnu/libQt5Core.so.5.9.5) ==29697== by 0x8CFEBA1: QAbstractButton::clicked(bool) (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.9.5) ==29697== by 0x8CFEDB9: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.9.5) ==29697== by 0x8D00199: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.9.5) ==29697== by 0x8D0038C: QAbstractButton::mouseReleaseEvent(QMouseEvent*) (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.9.5) ==29697== by 0x8C4C047: QWidget::event(QEvent*) (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.9.5)
How about running just the release build (without valgrind). Any stability issues encountered??
Yes, that is a known issue you reported it sometimes happen in the INDI client library thread. Sometimes it gets stuck on pthread_join as the thread does not exist cleanly. I haven't found what leads to this situation sometimes, it's rare but happens.
(In reply to Jasem Mutlaq from comment #6) > How about running just the release build (without valgrind). Any stability > issues encountered?? It's difficult for me to quantify that: the nature of memory corruption is that you experience uncorrelated crashes. More importantly: I just started using kstars so I can't compare to old versions (which apparently had fewer threads?), but for me 3.x is extremely unstable. I'm not sure I've ever had a session which did not crash. For example, I attempted to repro Bug #406638 with this bug fix on my Raspberry pi to see if it was related, and I just gave up for now. indi_simulator_guide crashes 100% of the time when I run through my repro sequence on ARM today (it crashes when I capture in the Guide tab). I also experienced kstars crashing but wasn't able to get any debug info from the core. I also saw at least one crash with this patch on amd64 but didn't acquire any usable debug info there either. I think the patch improves things, at least. (In reply to Jasem Mutlaq from comment #7) > does not exist cleanly. I haven't found what leads to this situation > sometimes, it's rare but happens. Have I got a nice solution for you! Run it under valgrind, all sorts of fun race conditions appear. :-) More generally, I'd suggest adding random time jitter debugging to make these sorts of bugs appear more often. For example, build with a random sleep around lock/unlock calls and then test.
We made heavy use of threading in 3.x and this is probably the reason why it got unstable (but more responsive I'd say). Hopefully we can tackle down the rest of the issues, any pointers would be appreciated if you found any patterns of interest.