I have a program which fork/exec's ssh. With valgrind 3.0.0 I could run that program under valgrind control and ssh would be executed just fine. With 3.1.0 I get a permission denied error for ssh. I know that valgrind does not support suid programs, but I do not want to debug the suid ssh itself, only a normal program invoking it. I tried with --trace-children on and off, did not make a difference. Here's a test case and system information: $ uname -a Linux knscsl010 2.4.9-e.57smp #1 SMP Thu Dec 2 20:51:12 EST 2004 i686 unknown $ cat /etc/redhat-release Red Hat Linux Advanced Server release 2.1AS (Pensacola) $ cat do_ssh #!/bin/sh ssh "$@" $ valgrind --version valgrind-3.1.0 $ valgrind ./do_ssh ==27744== Memcheck, a memory error detector. ==27744== Copyright (C) 2002-2005, and GNU GPL'd, by Julian Seward et al. ==27744== Using LibVEX rev 1471, a library for dynamic binary translation. ==27744== Copyright (C) 2004-2005, and GNU GPL'd, by OpenWorks LLP. ==27744== Using valgrind-3.1.0, a dynamic binary instrumentation framework. ==27744== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al. ==27744== For more details, rerun with: -v ==27744== ./do_ssh: /usr/bin/ssh: Permission denied ==27746== ==27746== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 2) ==27746== malloc/free: in use at exit: 0 bytes in 0 blocks. ==27746== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==27746== For counts of detected errors, rerun with: -v ==27746== No malloc'd blocks -- no leaks are possible. ==27744== ==27744== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 2) ==27744== malloc/free: in use at exit: 0 bytes in 0 blocks. ==27744== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==27744== For counts of detected errors, rerun with: -v ==27744== No malloc'd blocks -- no leaks are possible. When I look at the strace output, I see: ... 27862 stat64("/bin/ssh", 0xbefebb80) = -1 ENOENT (No such file or directory) 27862 stat64("/usr/bin/ssh", {st_mode=S_IFREG|S_ISUID|0755, st_size=220580, ...}) = 0 27862 getgroups32(32, [33500, 33531, 33528]) = 3 27862 stat64("/usr/bin/ssh", {st_mode=S_IFREG|S_ISUID|0755, st_size=220580, ...}) = 0 27862 mmap2(0x80da000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x80da000 27862 mmap2(0x80db000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x80db000 27862 rt_sigprocmask(SIG_SETMASK, ~[], ~[ILL TRAP BUS FPE KILL SEGV STOP], 8) = 0 27862 rt_sigtimedwait(~[INT CHLD], 0x62662dd4) = -1 EAGAIN (Resource temporarily unavailable) 27862 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP BUS FPE KILL SEGV STOP], NULL, 8) = 0 27862 rt_sigprocmask(SIG_SETMASK, ~[], ~[ILL TRAP BUS FPE KILL SEGV STOP], 8) = 0 27862 fork() = 27864 27864 --- SIGSTOP (Stopped (signal)) @ 0 (0) --- 27862 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP BUS FPE KILL SEGV STOP], <unfinished ...> 27864 gettid( <unfinished ...> ... 27864 write(2, "./do_ssh: /usr/bin/ssh: Permissi"..., 42) = 42 I would post the whole strace, but it is kind of large and might contain private information. Please let me know if you need further information.
There was recently a bug in VG_(getgroups) which caused process starts to fail. Can you check out the trunk and try that? (easy; http://www.valgrind.org/downloads/repository.html has the details)
I could reproduce this problem on Debian 3.0, but only by manually setting the suid bit on /usr/bin/ssh (normally it doesn't have that set). I also found a workaround: when executing a wrapper script instead of ssh directly in the program under valgrind control, then everything works unless I add --trace-children=yes.
I tried the trunk and it still fails.
As you have a workaround I'd like to close this. Ok ?
I'd prefer to get this fixed properly because the workaround would have to be installed for all future releases of valgrind, and other people might also run into it. After looking at the source it seems that check_executable in coregrind/m_libcfile.c is overly pessimistic about its ability to run the executable: it checks for SUID/SGID and flatly refuses to run such binaries even if they are not meant to be run under valgrind control, which in turn causes pre_exec_check and thus sys_execve to fail. I inserted a "return 0" at the top of check_executable and verified that this hacked valgrind executes my ssh with suid bit without any problems. How valid is the concern that SUID/SGID programs cannot be executed? With the hack above I can run ssh under valgrind control (I get some uninitialized memory reads; the ssh login into the remote machine works). Can the check be changed so that it allows them at least when they are not to be traced? The information whether the child is to be traced is available in sys_execve, but I am not sure whether this is the only relevant place and whether adding another parameter to pre_exec_check and sys_execve would be the right thing to do.
I'm not entirely sure why the exec wrapper is checking this - if the SUID bit is set and valgrind is controlling the child then the new valgrind instance that gets execed will refuse to load the child program anyway.
I'm not really qualified to answer Tom's question, but for the sake of moving this discussion along I'd like to point out that a comment in execve says that the check is there because valgrind would get into trouble recovering from the failure if loading the child failed. Whether that still is a valid concern I don't know. Some more thoughts: does valgrind still use LD_PRELOAD techniques? That definitely would fail with suid binaries, but as I said, with current valgrind I could analyze a suid binary just fine. On the other hand, further analysis shows that although a suid binary can be analyzed, that takes place without the suid bit having any effect (copy /usr/bin/id, set suid -> euid root without valgrind, no root with valgrind). So here's how I suggest to handle this: - let exec run suid binaries - if that means that they do not run under the intended user and/or group, print a warning refering the user to documentation which helps him sort out the problem The solution were running a suid binary simply fails with EACCESS is a lot harder to understand (okay, perhaps printing a warning isn't possible there - I don't know).
I encountered the same problem of having exec return that strange error. So I would also like to request a warning about this. An older version of valgrind may have printed such a warning: http://valgrind.org/docs/manual/dist.news.html says that valgrind 2.0.0 had this change: > - Don't fail silently if the executable is statically linked, or is > setuid/setgid. Print an error message instead. Was that error message accidentally removed from valgrind? Also, I suggest mentioning the setuid restriction in the manual where it describes --trace-children.
Note that the suid/sgid checks don't take into account whether the current effective user id/group id match the file to be execed. I.e. the check, as it is, is unduly restrictive: valgrind refuses to run an executable that is SUID, even if the SUID would have no effect on privileges. For example, the Oracle OCI libraries exec the $ORACLE_HOME/bin/oracle application (which is SUID) in order to communicate to a local database. Even if the client program is run as the same user (e.g. oracle) as the owner of the executable, valgrind will refuse to run it. Note also that valgrind will also refuse to run executables that are not readable (i.e. executables that are 'execute-only'). Obviously, if it must trace the to-be-execed executable, it needs read access to the executable. But if it does not need to trace the executable, then the only reason that it needs read access is so it can better guess as to whether execve will succeed (it checks the type of the file -- if it's neither ELF nor script, it knows execve will fail). There are several possible solutions to the problem. The most flexible (it seems to me) would be to not do any pre-exec checking if trace-children is not in effect, and to properly restore the process state if execve fails. Comments in the code suggest that this might be difficult. The pre-exec processing changes process state in three ways prior to invoking the system call: * it kills all threads * it resets some signal handling * it restores some limits (via setrlimit) The only non-recoverable thing in the above list is the killing of threads. I'm not 100% sure why valgrind does this in user code anyway -- the system call itself, as I understand it, will properly terminate all the threads anyway. The second solution would be to remove improve the pre-exec checking beforehand (i.e. don't worry about SUID if trace-children is not in effect. If it is, allow the exec if the effective user id matches the executables effective user id, and only check the format of the executable *if* it is readable). The third solution would be to allow the user to specify a 'whitelist' of executables that will be execed without checking, so that users can workaround the sometimes-overly-restrictive pre-exec checking. Additionally, in all cases, if an executable is 'execable' but not traceable (e.g. because it is suid, or is not readable), it should be an option to exec that particular child normally (without valgrind tracing). It may be the case that it would be useful to the user to trace those children which are traceable, and simply allow non-traceable children to execute normally.
> There are several possible solutions to the problem. The most flexible (it seems to me) would be to not do any pre-exec checking if trace-children is not in effect, and to properly restore the process state if execve fails. Comments in the code suggest that this might be difficult. The pre-exec processing changes process state in three ways prior to invoking the system call: > > * it kills all threads > * it resets some signal handling > * it restores some limits (via setrlimit) > > The only non-recoverable thing in the above list is the killing of threads. I'm not 100% sure why valgrind does this in user code anyway -- the system call itself, as I understand it, will properly terminate all the threads anyway. > Valgrind needs to report all the program termination information at exec time. If it does all the exit-time work to generate that report and the exec fails, it would probably be hard to recover all the state needed to continue (I don't know if its reasonable to require all tools to have non-destructive or reversable exit-time reporting). And if it simply does the exec without reporting, then lots of useful information may be lost. There are also some subtle signal races around exec which are hard to deal with. I seem to remember that when passing control from Valgrind's signal handling to the kernel's there are some tricky issues in making sure that pending signals aren't improperly delivered or dropped on the ground. Oh, yeah, the transition from having a signal handler to SIG_DFL is tricky; if there's a pending signal at exec time, then it needs to remain pending over the exec and then be delivered to the new process with the default handler. I don't think Valgrind gets this quite right now, but it ends up being OK if it can guarantee that exec will succeed once it does all the pulldown. > The second solution would be to remove improve the pre-exec checking beforehand (i.e. don't worry about SUID if trace-children is not in effect. If it is, allow the exec if the effective user id matches the executables effective user id, and only check the format of the executable *if* it is readable). > Allowing exec on SUID for no-trace would be OK, I think. But I'd be a bit worried about simply matching uid/gid[s] for exec - even if they match there are still changes to the process credentials (something subtle related to the handling of effective uid and saved effective uid). And these days, systems using selinux/credentials/etc need much more complex permissions computations. An alternative would be to get the kernel to do the test by running the execve in a controlled environment (fork off a process, run it under ptrace, kill it before it starts running anything), but that's pretty complex. And even then, Valgrind wouldn't know what the security environment of the new process should be. > The third solution would be to allow the user to specify a 'whitelist' of executables that will be execed without checking, so that users can workaround the sometimes-overly-restrictive pre-exec checking. > Possible, but not very pretty. How would you identify the whitelist executables? An equivalent but more generally useful thing would be the ability to list which children you want to trace. > Additionally, in all cases, if an executable is 'execable' but not traceable (e.g. because it is suid, or is not readable), it should be an option to exec that particular child normally (without valgrind tracing). It may be the case that it would be useful to the user to trace those children which are traceable, and simply allow non-traceable children to execute normally. > Yes, doable. J
>Possible, but not very pretty. How would you identify the whitelist >executables? An equivalent but more generally useful thing would be the >ability to list which children you want to trace. I would imagine that you could specify the whitelist in the same way as you could specify the list of children you want to trace. But the approaches aren't *quite* equivalent. The whitelist approach assumes that the precheck is desired even if we don't want to trace the child. This may be true, in some circumstances (e.g. we don't know beforehand what files the application will try to exec, such as when the application is an interactive shell.). The approach where the children to be traced are specified means, I assume, that other children (i.e. other unspecified files that get execed) will not be prechecked (except to see if they have execute permission, which I can't see a reason to not do in all circumstances). In either case (allowing children to be traced to be specified, or allowing certain executables to be 'whitelisted' for execution), the mechanism with which to specify the child could be similar, e.g. valgrind --trace-child=/usr/local/bin/foo ... or valgrind --allow-exec=/usr/local/bin/bar ... or allow the trace-list or whitelist to be read from a file: valgrind --trace-these-children=<file> --------------- With respect to SUID/SGID, is it not the case that if the current real-uid, effective-uid, and saved-uid are the same, and the file to-be-execed is owned by the same uid, then executing the file will have no effect on process authorizations? That's my current understanding. With respect to selinux, I don't *think* that SUID/SGID apply, do they?
> valgrind --trace-child=/usr/local/bin/foo ... > > or > > valgrind --allow-exec=/usr/local/bin/bar ... > > or allow the trace-list or whitelist to be read from a file: > > valgrind --trace-these-children=<file> > The trouble with pathnames is specifying what namespace context the name is evaluated in (cwd for relative paths; chroot and local namespaces for absolute). On the other hand, Valgrind doesn't deal with chroot and local namespaces terribly well anyway. Also, how should this work with fexecve? > With respect to SUID/SGID, is it not the case that if the current real-uid, > effective-uid, and saved-uid are the same, and the file to-be-execed is owned > by the same uid, then executing the file will have no effect on process > authorizations? That's my current understanding. Linux has uid, euid, saved uid, and fsuid. If they all match the suid executable uid, then I think nothing will happen. Simiarly, there's the gid, egid sand saved gid; there's no fsgid, but there is the aux groups. I'm not sure how they fit into the picture. So its all pretty subtle. Alternatively, you could just throw your hands up and add a --sim-hints=ignore-suid-exec to ignore the whole problem and go ahead anyway. > With respect to selinux, I don't *think* that SUID/SGID apply, do they? > I'm not sure, but I think the basic rule that there are no privilege changes (except perhaps narrowing of privilege) without sguid being set is still preserved. Otherwise very unexpected things could happen.
With respect to path issues, you could be as restrictive (the path provided on the command line has to EXACTLY match the path provided in the execve call) to as flexible as possible (e.g. use 'stat' on the original path provided, and 'stat' on the path passed into execve, and thereby determine if the file referenced in both cases is the same). Either way, the facility would probably flexible enough for most purposes. If you did it the 2nd way, I think fexecve would fall out naturally. But the simple option of having an 'ignore-suid-exec' or 'ignore-suid-or-execute-only-exec' option would probably answer for an overwhelming majority of cases.
Situation is improved somewhat by r7175, whose commit message is as follows. Make handling of setuid executables marginally more sensible, as suggested in #119404. Prior to this commit, if the current traced process attempted to execve a setuid executable, an error was always returned. The revised behaviour is: If the current (traced) process attempts to execve a setuid executable: * If --trace-children=yes is not in effect, the execve is allowed. * If --trace-children=yes is in effect, the execve is disallowed (as at present), but an error message is printed (unless in XML mode), so at least the execve does not fail silently any more. As per discussion on #119404 we could probably do a lot better, but these changes are at least simple, useful and uncontroversial.
Greater control of what is traced (whitelists/blacklists) is covered in bug 148932. Given that Julian improved things as comment 14 describes, I'm declaring victory on this one.