The following code in record_startup_wd(): /* Simple: just ask the kernel */ SysRes res; SizeT szB = 0; do { szB += 500; startup_wd = VG_(realloc)("startup_wd", startup_wd, szB); VG_(memset)(startup_wd, 0, szB); res = VG_(do_syscall2)(__NR_getcwd, (UWord)startup_wd, szB-1); } while (sr_isError(res)); Will loop forever when getcwd() produces any error. It should check that the error is ERANGE. Even when fixing that and returning an error (False) will make valgrind refuse to run the program. This isn't really necessary because often the startup_wd isn't really used or needed. Reproducible: Always
Created attachment 101237 [details] Don't require the current working directory to exist. At startup valgrind fetches the current working directory and stashes it way to be used later (in debug messages, read config files or create log files). But if the current working directory didn't exist (or there was some other error getting its path) then valgrind would go in a endless loop. This was caused by assuming that any error meant a larger buffer needed to be created to store the cwd path (ERANGE). However there could be other reasons calling getcwd failed. Fix this by only looping and resizing the buffer when the error is ERANGE. Any other error just means we cannot fetch and store the current working directory. Fix all callers to check get_startup_wd() returns NULL. Only abort startup if a relative path needs to be used for user supplied relative log files. Debug messages will just show "<NO CWD>". And skip reading any config files from the startup_wd if it doesn't exist. Also add a new testcase that tests executing valgrind in a non-existing directory (none/tests/nocwd.vgtest).
Ivo and Rhys could you double check that my patch does the right thing and the new nocwd.vgtest passes on Solaris and Darwin?
Will do.
Hi Mark, I get this compilation warning on Solaris: nocwd.c: In function ‘main’: nocwd.c:28:3: warning: missing sentinel in function call [-Wformat=] execlp ("echo", "echo", "Hello", "World", NULL); And indeed, man page for execlp (Linux & Solaris) says it is supposed to be used this way: int execlp(const char *file, const char *arg, ... /* (char *) NULL */); So this simple change fixes the warning: - execlp ("echo", "echo", "Hello", "World", NULL); + execlp ("echo", "echo", "Hello", "World", (char *) NULL); As for the test itself, it cannot simulate invalid cwd on Solaris, because rmdir returns EINVAL: EINVAL The directory to be removed is the current directory, or the final component of path is ".". Is there any other way how to simulate invalid cwd?
(In reply to Ivo Raisr from comment #4) > So this simple change fixes the warning: > - execlp ("echo", "echo", "Hello", "World", NULL); > + execlp ("echo", "echo", "Hello", "World", (char *) NULL); Thanks, fixed. > As for the test itself, it cannot simulate invalid cwd on Solaris, because > rmdir returns EINVAL: > EINVAL The directory to be removed is the current > directory, > or the final component of path is ".". > > Is there any other way how to simulate invalid cwd? Maybe the test doesn't make sense for Solaris if you cannot remove the cwd. Then maybe we should move it under none/tests/linux. Can you remove the cwd of another process? Then we could fork/exec another process in the test, move the cwd and then remove the cwd of its parents process. Then execution of echo (with --trace-children) will get a process without cwd as we want. Or is it possible to trigger any other getcwd failure? Maybe chmod the cwd (or its parent dir) to 000? Or create a directory hierarchy deeper than PATH_MAX?
Yes, cwd of another process can be removed; but as usually the process holds a reference on it: "If the directory's link count becomes zero and no process has the directory open, the space occupied by the directory is freed and the directory is no longer accessible. If one or more processes have the directory open when the last link is removed, the "." and ".." entries, if present, are removed before rmdir() returns and no new entries may be created in the directory, but the directory is not removed until all references to the directory have been closed." Can you sketch in a bit more detail other approaches?
Created attachment 101309 [details] new nocwd testcase How about this variant of the test. It tries a couple of things to make sure getcwd will fail. Create a path that is > PATH_MAX, make one component inaccessible and remove the current dir. Although all these things can be done on a linux system there is no error checking in case some different kernel doesn't allow any of these things. The testcase should fail (loop indefinitely) with current valgrind, but succeed with the patch. (You can check by running with ./vg-in-place -d --trace-children=yes none/test/nocwd and searching for "Getting the working directory at startup" - with the patch it will probably say "main ... <NO CWD>".)
Hi Mark, the new test case works ok (I am able to simulate the problem on Solaris) with 'make one component inaccessible'. Please include "string[s].h" for proper strlen declaration. My apologies it took so long to verify it. I'm a bit overwhelmed these days...
(In reply to Ivo Raisr from comment #8) > the new test case works ok (I am able to simulate the problem on Solaris) > with 'make one component inaccessible'. Thanks for testing. Lets go with this then. > Please include "string[s].h" for proper strlen declaration. Added. > My apologies it took so long to verify it. I'm a bit overwhelmed these > days... No need to apologize. We are all busy all the time :)
valgrind svn r15989