109861 – valgrind freezes on start/exit of valgrind itself

Bug 109861 - valgrind freezes on start/exit of valgrind itself

Summary: valgrind freezes on start/exit of valgrind itself

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (show other bugs)
Version:	3.0 SVN
Platform:	Compiled Sources Linux

Importance:	NOR major
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Duplicates (1):	110301 (view as bug list)
Depends on:
Blocks:

Reported:	2005-07-30 00:46 UTC by Christian Parpart
Modified:	2005-11-08 20:34 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed In:

Attachments
$(strace -T valgrind -d) (9.81 KB, text/plain) 2005-07-31 02:24 UTC, Christian Parpart	Details
Patch to restrict client address space to 16Gb (1.34 KB, patch) 2005-07-31 12:46 UTC, Tom Hughes	Details
$(strace -T valgrind-with-16G-patch -d) (9.81 KB, text/plain) 2005-07-31 22:45 UTC, Christian Parpart	Details
$(strace -T valgrind -d) syscall timing results (9.21 KB, text/plain) 2005-08-02 13:08 UTC, Christian Parpart	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Christian Parpart 2005-07-30 00:46:04 UTC

Hi, just as reported on the list before, this problem still exists. 
 
and I consider it critical anyway as it makes working with it awful. I just 
can't wait 2 minutes until I see any reaction, and just can't wait another two 
minutes until the app exits. 
 
however, we already discovered (on the list) that this problem *might* be the 
mmap/munmap's being invoked by the address space manager.

Comment 1 Christian Parpart 2005-07-30 00:47:34 UTC

just missed to say that :)

$ uname -a
Linux battousai 2.6.12-gentoo-r6 #1 Sat Jul 23 00:48:19 CEST 2005 x86_64 AMD Athlon(tm) 64 Processor 3500+ AuthenticAMD GNU/Linux

what information do you need else?

Comment 2 Tom Hughes 2005-07-30 01:14:46 UTC

Well an strace with -T should show how long it is spending in each system call which will confirm the suspected diagnosis.

If the problem is down to the kernel taking a long time to do large mmaps then there probably isn't much we can do until the address space manager is reworked which isn't going to be done for the 3.0.0 release I'm afraid.

It owuld be interesting to know why your gentoo 2.6.12 kernel behaves differently to my FC4 2.6.12 kernel on amd64 though.

Comment 3 Christian Parpart 2005-07-31 02:24:50 UTC

Created attachment 12006 [details]
$(strace -T valgrind -d)

this is the output of: strace -T valgrind -d

Comment 4 Tom Hughes 2005-07-31 10:37:50 UTC

Ah, I hadn't realised you were talking about a PIE build... There is a reason we disabled that by default, namely that it doesn't work very well ;-)

The problem comes when valgrind does the big bang shadow memory allocation - it tries to map the entire shadow memory area up front and then later on it just changes the protection on the pages it wants to use.  

It it clearly the kernel causing the problem - on your machine it takes 109 seconds to map the shadow memory. On my FC4 box (running kernel  2.6.12-1.1398_FC4smp) it takes just 0.7 seconds to do the same mapping in a PIE build.

The mapping is large (about 67Tb if I've done my maths right).

Comment 5 Tom Hughes 2005-07-31 11:39:29 UTC

Hmm. I see that the munmap that removes the padding put in place by stage1 also takes 94 seconds on your machine but only about 0.6 seconds on my machine.

The strange thing is that when stage1 pads the address space, which is a mapping about twice the size of the shadow memory, it only takes 0.000005 seconds on your machine. Now that is a file backed mapping rather than an anonymous mapping which might explain it.

Comment 6 Tom Hughes 2005-07-31 12:46:40 UTC

Created attachment 12010 [details]
Patch to restrict client address space to 16Gb

Try this patch - it will restrict the client address space to 16Gb (hopefully
that is enough for you?) which will make the problem maps and unmaps much
smaller and hopefully speed things up.

Because the first 16Gb of memory is the bit which memcheck can handle
efficiently this ought to mean that a PIE build won't be quite as slow either.

It doesn't seem to have had as much effect on the memcheck slowdown as I hoped
though - the test suite takes 2m02 in non-PIE mode and 6m16 in PIE mode with my
patch (versus 8m49 in PIE mode without the patch).

Comment 7 Nicholas Nethercote 2005-07-31 16:41:49 UTC

> It doesn't seem to have had as much effect on the memcheck slowdown as I hoped
> though - the test suite takes 2m02 in non-PIE mode and 6m16 in PIE mode with my
> patch (versus 8m49 in PIE mode without the patch).


Wow, that's a terrible slowdown.  I realise that these are all very 
short-running programs but having about 1.3s worth of start-up time 
devoted just to mapping/unmapping the address space is bad.  All the more 
reason to switch to the chunk-based scheme.

Comment 8 Tom Hughes 2005-07-31 17:16:37 UTC

Well with my patch to restruct the client address space the startup time on my machine is much better - the umnap for the client hole takes 0.000342s and the map of the shadow space 0.000225 seconds, so only about 0.5ms in total.

That makes the regtest slowdown all the more wierd. I has always assumed that it was much slower in PIE mode due to the auxmaps being used but with my patch those should not be kicking in at all.

Comment 9 Christian Parpart 2005-07-31 22:45:52 UTC

Created attachment 12025 [details]
$(strace -T valgrind-with-16G-patch -d)

I just tested your proposed patch;

interestingly, the startup seems really quite fast now (OMG, i like it *g*);
but the exit still takes hell;

I patched'n'tested against vex r1306 and valgrind r4297;

Comment 10 Christian Parpart 2005-07-31 22:49:21 UTC

arrr, and personally, regarding that topic "Shadow space", I like the idea of not m[un]mapping the space in favor of doing so on demand instead; however, I'm really far from being a geek regarding valgrind ;-)

Regards,
Christian Parpart.

Comment 11 Julian Seward 2005-08-01 03:08:26 UTC

> Wow, that's a terrible slowdown.  I realise that these are all very
> short-running programs but having about 1.3s worth of start-up time
> devoted just to mapping/unmapping the address space is bad.  All the more
> reason to switch to the chunk-based scheme.


Yes .. getting the address space management sorted out properly is
becoming high priority.

J

Comment 12 Christian Parpart 2005-08-02 13:08:46 UTC

Created attachment 12046 [details]
$(strace -T valgrind -d) syscall timing results

Hi all,

the attachement is a log of `strace -T valgrind -d` with recent vex/valgrind
svn build including the nobigbang patch.

Interestingly, the startup (again) seems quite fast, but the VG exit still
hangs around like playing dead.

Having a look at line 120 within the logfile, I still see an munmap(0,
$bigbang) = 0 with a timing of 278.4 seconds.

I believe this patch makes mmap()'s somewhat on-demand but still munmaps()
everhthing at once (in a big bang - even areas never being mmap()ed).

Comment 13 Tom Hughes 2005-08-02 13:22:29 UTC

That munmap is not the shadow memory, it's making the client hole (line 263 of m_main.c) which is what I talked about before. Strangely it only took 94s in the previous trace you posted.

Try adding the other patch from this bug as well - that will reduce the size of that unmap to 16Gb.

All of that is startup time though. I suspect the shutdown time is not actually in the strace output and is implicit in the kernel unmapping the process after exit is called.

The real underlying problem here is that your kernel is really bad at doing these large maps/unmaps for some reason when other 2.6.12 kernels cope fine.

Comment 14 Tom Hughes 2005-08-10 15:40:35 UTC

*** Bug 110301 has been marked as a duplicate of this bug. ***

Comment 15 Mihai RUSU 2005-08-13 16:30:53 UTC

Regarding comment #2 and #4, please note that the problems show up on vanilla too (2.6.12.3 at least). This is not a gentoo kernel only issue. It rather seems that the FC kernel has something special that works arround this.

Comment 16 Christian Parpart 2005-08-13 22:56:32 UTC

to circle a little bit around -> https://bugs.gentoo.org/show_bug.cgi?id=102157

Well, just for the record ;)

Comment 17 Tom Hughes 2005-10-08 15:57:25 UTC

Can you try this with the current SVN code please - the address space management has been completely rewritten and it shouldn't be doing these big mmap/munmap operations anymore.

Comment 18 Julian Seward 2005-10-26 13:35:06 UTC

Christian, can you confirm status of this?

Comment 19 Nicholas Nethercote 2005-11-08 20:34:47 UTC

I'm going to close this due to lack of response.  Christian, if it's still a problem, please reopen.  If it's fixed, it would be nice if you could confirm for us.  Thanks.