Bug 369209 - valgrind loops and eats up all memory (very slowly) when the current working directory doesn't exist.
Summary: valgrind loops and eats up all memory (very slowly) when the current working ...
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 3.12 SVN
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-22 21:43 UTC by Mark Wielaard
Modified: 2016-10-01 12:02 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Don't require the current working directory to exist. (10.73 KB, patch)
2016-09-22 21:51 UTC, Mark Wielaard
Details
new nocwd testcase (920 bytes, text/plain)
2016-09-26 22:23 UTC, Mark Wielaard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Wielaard 2016-09-22 21:43:25 UTC
The following code in record_startup_wd():

    /* Simple: just ask the kernel */
    SysRes res;
    SizeT szB = 0;
    do {
       szB += 500;
       startup_wd = VG_(realloc)("startup_wd", startup_wd, szB);
       VG_(memset)(startup_wd, 0, szB);
       res = VG_(do_syscall2)(__NR_getcwd, (UWord)startup_wd, szB-1);
    } while (sr_isError(res));

Will loop forever when getcwd() produces any error. It should check that the error is ERANGE.
Even when fixing that and returning an error (False) will make valgrind refuse to run the program.
This isn't really necessary because often the startup_wd isn't really used or needed.

Reproducible: Always
Comment 1 Mark Wielaard 2016-09-22 21:51:09 UTC
Created attachment 101237 [details]
Don't require the current working directory to exist.

At startup valgrind fetches the current working directory and stashes
it way to be used later (in debug messages, read config files or create
log files). But if the current working directory didn't exist (or there
was some other error getting its path) then valgrind would go in a
endless loop. This was caused by assuming that any error meant a larger
buffer needed to be created to store the cwd path (ERANGE). However
there could be other reasons calling getcwd failed.

Fix this by only looping and resizing the buffer when the error is
ERANGE. Any other error just means we cannot fetch and store the current
working directory. Fix all callers to check get_startup_wd() returns
NULL. Only abort startup if a relative path needs to be used for
user supplied relative log files. Debug messages will just show
"<NO CWD>". And skip reading any config files from the startup_wd
if it doesn't exist.

Also add a new testcase that tests executing valgrind in a non-existing
directory (none/tests/nocwd.vgtest).
Comment 2 Mark Wielaard 2016-09-22 21:56:17 UTC
Ivo and Rhys could you double check that my patch does the right thing and the new nocwd.vgtest passes on Solaris and Darwin?
Comment 3 Ivo Raisr 2016-09-23 06:02:57 UTC
Will do.
Comment 4 Ivo Raisr 2016-09-23 10:07:37 UTC
Hi Mark,
I get this compilation warning on Solaris:

nocwd.c: In function ‘main’:
nocwd.c:28:3: warning: missing sentinel in function call [-Wformat=]
   execlp ("echo", "echo", "Hello", "World", NULL);

And indeed, man page for execlp (Linux & Solaris) says it is supposed to be used this way:
int execlp(const char *file, const char *arg, ...
                       /* (char  *) NULL */);

So this simple change fixes the warning:
-   execlp ("echo", "echo", "Hello", "World", NULL);
+  execlp ("echo", "echo", "Hello", "World", (char *) NULL);


As for the test itself, it cannot simulate invalid cwd on Solaris, because rmdir returns EINVAL:
       EINVAL          The  directory  to be removed is the current directory,
                       or the final component of path is ".".

Is there any other way how to simulate invalid cwd?
Comment 5 Mark Wielaard 2016-09-23 19:43:41 UTC
(In reply to Ivo Raisr from comment #4)
> So this simple change fixes the warning:
> -   execlp ("echo", "echo", "Hello", "World", NULL);
> +  execlp ("echo", "echo", "Hello", "World", (char *) NULL);

Thanks, fixed. 

> As for the test itself, it cannot simulate invalid cwd on Solaris, because
> rmdir returns EINVAL:
>        EINVAL          The  directory  to be removed is the current
> directory,
>                        or the final component of path is ".".
> 
> Is there any other way how to simulate invalid cwd?

Maybe the test doesn't make sense for Solaris if you cannot remove the cwd.
Then maybe we should move it under none/tests/linux.

Can you remove the cwd of another process?  Then we could fork/exec another process
in the test, move the cwd and then remove the cwd of its parents process.
Then execution of echo (with --trace-children) will get a process without cwd as we want.

Or is it possible to trigger any other getcwd failure?
Maybe chmod the cwd (or its parent dir) to 000?
Or create a directory hierarchy deeper than PATH_MAX?
Comment 6 Ivo Raisr 2016-09-24 22:27:37 UTC
Yes, cwd of another process can be removed; but as usually the process holds a reference on it:
       "If  the  directory's  link  count  becomes  zero and no process has the
       directory open, the space occupied by the directory is  freed  and  the
       directory  is  no  longer accessible. If one or more processes have the
       directory open when the last link is removed, the "." and ".." entries,
       if  present,  are removed before rmdir() returns and no new entries may
       be created in the directory, but the directory is not removed until all
       references to the directory have been closed."

Can you sketch in a bit more detail other approaches?
Comment 7 Mark Wielaard 2016-09-26 22:23:19 UTC
Created attachment 101309 [details]
new nocwd testcase

How about this variant of the test.
It tries a couple of things to make sure getcwd will fail.
Create a path that is > PATH_MAX, make one component inaccessible and remove the current dir.
Although all these things can be done on a linux system there is no error checking in case some different kernel doesn't allow any of these things.

The testcase should fail (loop indefinitely) with current valgrind, but succeed with the patch.
(You can check by running with ./vg-in-place -d --trace-children=yes none/test/nocwd and searching for "Getting the working directory at startup" - with the patch it will probably say "main ... <NO CWD>".)
Comment 8 Ivo Raisr 2016-09-29 21:03:48 UTC
Hi Mark,
the new test case works ok (I am able to simulate the problem on Solaris) with
'make one component inaccessible'.
Please include "string[s].h" for proper strlen declaration.
My apologies it took so long to verify it. I'm a bit overwhelmed these days...
Comment 9 Mark Wielaard 2016-10-01 11:34:29 UTC
(In reply to Ivo Raisr from comment #8)
> the new test case works ok (I am able to simulate the problem on Solaris)
> with 'make one component inaccessible'.

Thanks for testing. Lets go with this then.

> Please include "string[s].h" for proper strlen declaration.

Added.

> My apologies it took so long to verify it. I'm a bit overwhelmed these
> days...

No need to apologize. We are all busy all the time :)
Comment 10 Mark Wielaard 2016-10-01 12:02:00 UTC
valgrind svn r15989