433778 – Valgrind crashes when analyzing glusterfs client

Bug 433778 - Valgrind crashes when analyzing glusterfs client

Summary: Valgrind crashes when analyzing glusterfs client

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	memcheck (show other bugs)
Version:	unspecified
Platform:	FreeBSD Ports FreeBSD

Importance:	NOR crash
Target Milestone:	---
Assignee:	Paul Floyd

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-03-01 08:09 UTC by CES
Modified:	2021-03-03 21:12 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description CES 2021-03-01 08:09:36 UTC

SUMMARY

Glusterfs client on freebsd is suffering a memory leak and gets OOM killed, 
even with most recent glusterfs 8.3 on most recent freebsd 12.2/13.

Statedump of glusterfs doesn't show anything helpfull, no over-sized objects are shown. 
To further analyze this issue, I wanted to start the glusterfs client with valgrind.

Starting glusterfs client with valgrind results in some warings, an error message from valgrind and then shuts down.
(see logs below)

STEPS TO REPRODUCE
1. Install valgrind 3.17.0 from freebsd pkg manager/ports
2. Install glusterfs from pkg manager/ports (there is still the older 8.0, 8.3 can be compiled from ports by switching 8.0 with the newer 8.3)
3. Start glusterfs under valgrind
(valgrind /usr/local/sbin/glusterfs --process-name fuse --no-daemon --volfile-server=gluster2 --volfile-id=/volume1 /mnt/glusterfs)

OBSERVED RESULT
Warnings about unhandlet amd64-freebsd syscalls are shown (570)
The following message is shown:
valgrind: m_syswrap/syswrap-main.c:2172 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'VG_(iseqsigset)(&tst->sig_mask, &tst->tmp_sig_mask)' failed.
valgrind/glusterfs shuts down => filesystem is not mounted and the memory leak can not be tested.

EXPECTED RESULT
Correctly start glusterfs under valgrind without crash should mount the glusterfs filesystem, 
filesystem load tests can be started, output of memory usage of glusterfs should be shown after unmounting the filesystem after the loadtest.

SOFTWARE/OS VERSIONS
FreeBSD: FreeBSD Webserver5 12.2-RELEASE-p3 FreeBSD 12.2-RELEASE-p3 GENERIC  amd64
GlusterFS: glusterfs-8.3
Valgrind: valgrind-devel-3.17.0.g20200723,1
(same happens with the forked valgrind here: https://github.com/paulfloyd/freebsd_valgrind)


ADDITIONAL INFORMATION

Full log of test run:
==20181== Memcheck, a memory error detector
==20181== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==20181== Using Valgrind-3.17.0.GIT and LibVEX; rerun with -h for copyright info
==20181== Command: /usr/local/sbin/glusterfs --process-name fuse --no-daemon --volfile-server=gluster2 --volfile-id=/volume1 /mnt/glusterfs
==20181==
--20181-- WARNING: unhandled amd64-freebsd syscall: 570
--20181-- You may be able to write your own handler.
--20181-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--20181-- Nevertheless we consider this a bug.  Please report
--20181-- it at http://valgrind.org/support/bug_reports.html.
--20181-- WARNING: unhandled amd64-freebsd syscall: 570
--20181-- You may be able to write your own handler.
--20181-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--20181-- Nevertheless we consider this a bug.  Please report
--20181-- it at http://valgrind.org/support/bug_reports.html.
==20592==
==20592== HEAP SUMMARY:
==20592==     in use at exit: 668,125 bytes in 90 blocks
==20592==   total heap usage: 130 allocs, 40 frees, 880,421 bytes allocated
==20592==
==20673==
==20673== HEAP SUMMARY:
==20673==     in use at exit: 668,399 bytes in 98 blocks
==20673==   total heap usage: 146 allocs, 48 frees, 881,543 bytes allocated
==20673==
==20592== LEAK SUMMARY:
==20592==    definitely lost: 0 bytes in 0 blocks
==20592==    indirectly lost: 0 bytes in 0 blocks
==20592==      possibly lost: 620,589 bytes in 81 blocks
==20592==    still reachable: 14,768 bytes in 8 blocks
==20592==         suppressed: 32,768 bytes in 1 blocks
==20592== Rerun with --leak-check=full to see details of leaked memory
==20592==
==20592== For lists of detected and suppressed errors, rerun with: -s
==20592== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==20673== LEAK SUMMARY:
==20673==    definitely lost: 224 bytes in 1 blocks
==20673==    indirectly lost: 50 bytes in 7 blocks
==20673==      possibly lost: 620,589 bytes in 81 blocks
==20673==    still reachable: 14,768 bytes in 8 blocks
==20673==         suppressed: 32,768 bytes in 1 blocks
==20673== Rerun with --leak-check=full to see details of leaked memory
==20673==
==20673== For lists of detected and suppressed errors, rerun with: -s
==20673== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

valgrind: m_syswrap/syswrap-main.c:2172 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'VG_(iseqsigset)(&tst->sig_mask, &tst->tmp_sig_mask)' failed.

host stacktrace:
==20181==    at 0x38102002: ??? (in /usr/local/lib/valgrind/memcheck-amd64-freebsd)
==20181==    by 0x40536CFDF: ???
==20181==    by 0x380FAB38: ??? (in /usr/local/lib/valgrind/memcheck-amd64-freebsd)
==20181==    by 0x38102001: ??? (in /usr/local/lib/valgrind/memcheck-amd64-freebsd)
==20181==    by 0x40536C83F: ???
==20181==    by 0x402010F6F: ???

sched status:
  running_tid=6

Thread 1: status = VgTs_WaitSys syscall 209 (lwpid 100118)
==20181==    at 0x4EB728A: _poll (in /lib/libc.so.7)
==20181==    by 0x4A4CA25: ??? (in /lib/libthr.so.3)
==20181==    by 0x49080D1: ??? (in /usr/local/lib/libglusterfs.so.0.0.1)
==20181==    by 0x20EDDD: main (in /usr/local/sbin/glusterfsd)
client stack range: [0x7FBFFC000 0x7FC000FFF] client SP: 0x7FC000658
valgrind stack range: [0x4029AE000 0x402AADFFF] top usage: 7152 of 1048576

Thread 2: status = VgTs_WaitSys syscall 454 (lwpid 100370)
==20181==    at 0x4A5868C: ??? (in /lib/libthr.so.3)
==20181==    by 0x4A4BEAF: ??? (in /lib/libthr.so.3)
==20181==    by 0x4A55CAA: ??? (in /lib/libthr.so.3)
==20181==    by 0x48BEC2D: ??? (in /usr/local/lib/libglusterfs.so.0.0.1)
==20181==    by 0x4A49FAB: ??? (in /lib/libthr.so.3)
client stack range: [0x7FFFDFDFE000 0x7FFFDFFFDFFF] client SP: 0x7FFFDFFFDEC8
valgrind stack range: [0x404E49000 0x404F48FFF] top usage: 3088 of 1048576

Thread 3: status = VgTs_WaitSys syscall 429 (lwpid 100371)
==20181==    at 0x4EA4C4A: _sigwait (in /lib/libc.so.7)
==20181==    by 0x4A4F97A: ??? (in /lib/libthr.so.3)
==20181==    by 0x20D9EA: glusterfs_sigwaiter (in /usr/local/sbin/glusterfsd)
==20181==    by 0x4A49FAB: ??? (in /lib/libthr.so.3)
client stack range: [0x7FFFDFBFD000 0x7FFFDFDFCFFF] client SP: 0x7FFFDFDFCF38
valgrind stack range: [0x404F4D000 0x40504CFFF] top usage: 2928 of 1048576

Thread 4: status = VgTs_WaitSys syscall 240 (lwpid 100372)
==20181==    at 0x4EA4E2A: _nanosleep (in /lib/libc.so.7)
==20181==    by 0x4A4C8FB: ??? (in /lib/libthr.so.3)
==20181==    by 0x4E0D58A: sleep (in /lib/libc.so.7)
==20181==    by 0x48D4277: ??? (in /usr/local/lib/libglusterfs.so.0.0.1)
==20181==    by 0x4A49FAB: ??? (in /lib/libthr.so.3)
client stack range: [0x7FFFDF9FC000 0x7FFFDFBFBFFF] client SP: 0x7FFFDFBF9F08
valgrind stack range: [0x405055000 0x405154FFF] top usage: 3024 of 1048576

Thread 5: status = VgTs_WaitSys syscall 454 (lwpid 100373)
==20181==    at 0x4A5868C: ??? (in /lib/libthr.so.3)
==20181==    by 0x4A4BEAF: ??? (in /lib/libthr.so.3)
==20181==    by 0x4A55CAA: ??? (in /lib/libthr.so.3)
==20181==    by 0x48E7337: ??? (in /usr/local/lib/libglusterfs.so.0.0.1)
==20181==    by 0x48E7944: ??? (in /usr/local/lib/libglusterfs.so.0.0.1)
==20181==    by 0x4A49FAB: ??? (in /lib/libthr.so.3)
client stack range: [0x7FFFDF7FB000 0x7FFFDF9FAFFF] client SP: 0x7FFFDF9FAEA8
valgrind stack range: [0x405161000 0x405260FFF] top usage: 3024 of 1048576

Thread 6: status = VgTs_Runnable syscall 232 (lwpid 100374)
==20181==    at 0x4EC5C4A: _clock_gettime (in /lib/libc.so.7)
==20181==    by 0x48DD8D8: timespec_now (in /usr/local/lib/libglusterfs.so.0.0.1)
==20181==    by 0x6A9AB85: ??? (in /usr/local/lib/glusterfs/8.3/xlator/cluster/afr.so)
==20181==    by 0x6ABA33A: ??? (in /usr/local/lib/glusterfs/8.3/xlator/cluster/afr.so)
==20181==    by 0x48E6695: ??? (in /usr/local/lib/libglusterfs.so.0.0.1)
==20181==    by 0x4E04448: ??? (in /lib/libc.so.7)
client stack range: ??????? client SP: 0x773DED8
valgrind stack range: [0x40526D000 0x40536CFFF] top usage: 4312 of 1048576

Thread 7: status = VgTs_WaitSys syscall  93 (lwpid 100375)
==20181==    at 0x4F138DA: _select (in /lib/libc.so.7)
==20181==    by 0x4A4CCB1: ??? (in /lib/libthr.so.3)
==20181==    by 0x49237C2: runner (in /usr/local/lib/libglusterfs.so.0.0.1)
==20181==    by 0x4A49FAB: ??? (in /lib/libthr.so.3)
client stack range: [0x7FFFDF3F9000 0x7FFFDF5F8FFF] client SP: 0x7FFFDF5F8F18
valgrind stack range: [0x405379000 0x405478FFF] top usage: 3184 of 1048576


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

Comment 1 CES 2021-03-01 19:43:33 UTC

Paul Floyd (porter of valgrind to freebsd) helped me to resolve this issue (simply comment out line 2172 (should always be that easy!)
https://github.com/paulfloyd/freebsd_valgrind/issues/155

I'm closing this issue here...

Comment 2 Paul Floyd 2021-03-03 16:01:01 UTC

No, there really is a problem, which I've now identified.

On FreeBSD swapcontext is a syscall. This can alter the registers and the signal mask. This means that the assert that the signal flags are the same at the end of VG_(client_syscall) as follows



   /* Make sure the tmp signal mask matches the real signal mask;
      sigsuspend may change this. */
   vg_assert(VG_(iseqsigset)(&tst->sig_mask, &tst->tmp_sig_mask));


needs to be changed to something like

   vg_assert(VG_(isswapcontext)(sysno) || VG_(iseqsigset)(&tst->sig_mask, &tst->tmp_sig_mask));

where the 'isswapcontext' function 
a) always returns false on platforms where swapcontext is not a syscall
b) returns true on FreeBSD if sysno is NR__swapcontext

Comment 3 Paul Floyd 2021-03-03 21:12:19 UTC

It was a bit easier than that. sig_mask was being set but not tmp_sig_mask. Now both get updated.