Bug 333051 - mmap of huge pages fails due to incorrect alignment
Summary: mmap of huge pages fails due to incorrect alignment
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: unspecified
Platform: unspecified Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
: 339163 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-04-03 22:18 UTC by Tommy Janjusic
Modified: 2020-01-23 13:09 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
adding another layer on mmap, ignoring the MAP_FIXED argument (804 bytes, patch)
2014-04-03 22:25 UTC, Tommy Janjusic
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tommy Janjusic 2014-04-03 22:18:19 UTC
When an application requests a mmap() to a huge page it seems the generic mmap() handler fails.

The reason it fails is that (at least on my system) the mmap() to a huge page fd, cannot be MAP_FIXED if the huge page starting address is not hugepage_size aligned, and the generic-linux sys wrap always uses MAP_FIXED.

<$uname -a Linux titan 2.6.32.59-0.7-default #1 SMP 2012-07-13 15:50:56 +0200 x86_64 x86_64 x86_64 GNU/Linux>)





Reproducible: Always

Steps to Reproduce:
1.) Compile this simple code
#include <stdio.h>
#include <sys/mman.h>

int main(int argc, char** argv)
{
  int fd;
  char *m;

  fd = open("<path to huge page file; e.g. /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.26201.kvs_4761352>",  66, 493 );
  printf ("open result : %d\n", fd);
  unlink ("<path to file>/var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.26201.kvs_4761352");
  m = (char*) mmap (0x0, 4194304, 3, 1, fd, 0);
  printf ("mmap result %p\n", m);
  return 0;
}
2. run it under valgrind (with/out strace), you'll see the mmap failing
Actual Results:  
janjust@titan-batch4:~/janjust_proj/tmp$ aprun -n 1 -N 1 ../valgrind-build-titan/bin/valgrind --tool=none ./a.out
==21223== Nulgrind, the minimal Valgrind tool
==21223== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote.
==21223== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==21223== Command: ./a.out
==21223==
open result : 3
mmap result 0xffffffffffffffff
==21223==
Application 4781929 resources: utime ~0s, stime ~0s, Rss ~22540, inblocks ~3278, outblocks ~114
janjust@titan-batch4:~/janjust_proj/tmp$

Expected Results:  
janjust@titan-batch4:~/janjust_proj/tmp$ aprun -n 1 -N 1 ./a.out
open result : 3
mmap result 0x2aaaaae00000 (or some other address)
Application 4781926 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~2532, outblocks ~6606
Comment 1 Tommy Janjusic 2014-04-03 22:25:52 UTC
Created attachment 85950 [details]
adding another layer on mmap, ignoring the MAP_FIXED argument

there could be a better way of handling this by giving another hint to space_manager that we are looking for a particular alignment. for huge pages and MAP_FIXED 4k alignment is not sufficient.

eg.
if(size>8MB) {give me 8MB aligned address}
elseif(size>4MB{give me 4MB aligned address}
elseif(size>2MB{give me 2MB aligned address}
elseif(size>1MB{give me 1MB aligned address}
else just give me 4k aligned….
Comment 2 Tom Hughes 2014-09-18 08:36:17 UTC
*** Bug 339163 has been marked as a duplicate of this bug. ***
Comment 3 Nathan Kurz 2014-09-18 17:33:07 UTC
I applied Tommy's patch to 3.10.0 and it allowed the mmap() of both 2MB and 1GB hugepages to work correctly.  Post-patch results  are below.  See https://bugs.kde.org/show_bug.cgi?id=339163 for the pre-patch results and the testcase used.

nate@haswell:~$ valgrind valgrind_hugepage_mmap_bug 4KB
==2910== Memcheck, a memory error detector
==2910== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==2910== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==2910== Command: valgrind_hugepage_mmap_bug 4KB
==2910==
Opening /dev/zero
Trying mmap() for 4KB pages
Writing 100000 bytes.
Success
==2910==
==2910== HEAP SUMMARY:
==2910==     in use at exit: 0 bytes in 0 blocks
==2910==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==2910==
==2910== All heap blocks were freed -- no leaks are possible
==2910==
==2910== For counts of detected and suppressed errors, rerun with: -v
==2910== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

nate@haswell:~$ valgrind valgrind_hugepage_mmap_bug 2MB
==2911== Memcheck, a memory error detector
==2911== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==2911== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==2911== Command: valgrind_hugepage_mmap_bug 2MB
==2911==
Opening /var/lib/hugetlbfs/global/pagesize-2097152/page
Trying mmap() for 2MB pages
Writing 100000 bytes.
Success
==2911==
==2911== HEAP SUMMARY:
==2911==     in use at exit: 0 bytes in 0 blocks
==2911==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==2911==
==2911== All heap blocks were freed -- no leaks are possible
==2911==
==2911== For counts of detected and suppressed errors, rerun with: -v
==2911== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

nate@haswell:~$ valgrind valgrind_hugepage_mmap_bug 1GB
==2913== Memcheck, a memory error detector
==2913== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==2913== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==2913== Command: valgrind_hugepage_mmap_bug 1GB
==2913==
Opening /var/lib/hugetlbfs/global/pagesize-1073741824/page
Trying mmap() for 1GB pages
Writing 100000 bytes.
Success
==2913==
==2913== HEAP SUMMARY:
==2913==     in use at exit: 0 bytes in 0 blocks
==2913==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==2913==
==2913== All heap blocks were freed -- no leaks are possible
==2913==
==2913== For counts of detected and suppressed errors, rerun with: -v
==2913== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

nate@haswell:~$ strace valgrind valgrind_hugepage_mmap_bug 4KB
...
mmap(0x4026000, 100000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x4026000

nate@haswell:~$ strace valgrind valgrind_hugepage_mmap_bug 2MB
...
mmap(0x4026000, 100000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EINVAL (Invalid argument)
mmap(0x4026000, 100000, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x2aaaaac00000

nate@haswell:~$ strace valgrind valgrind_hugepage_mmap_bug 1GB
...
mmap(0x4026000, 100000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EINVAL (Invalid argument)
mmap(0x4026000, 100000, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x40000000
Comment 4 Tom Hughes 2014-09-18 17:48:05 UTC
Well of course it works. it's just completely wrong because it removes valgrind's control over where things are mapped - the MAP_FIXED is added on purpose, to ensure that things are mapped where valgrind wants them.

The correct fix is for aspacemgr to be told to select an address with the correct alignment, but that requires us being able to tell that we are dealing with huge pages.
Comment 5 Tommy Janjusic 2014-09-18 17:55:43 UTC
(In reply to Tom Hughes from comment #4)


> Well of course it works. it's just completely wrong because it removes
> valgrind's control over where things are mapped - the MAP_FIXED is added on
> purpose, to ensure that things are mapped where valgrind wants them.
> 
> The correct fix is for aspacemgr to be told to select an address with the
> correct alignment, but that requires us being able to tell that we are
> dealing with huge pages.

I've noted this in my original comment I thought...

>adding another layer on mmap, ignoring the MAP_FIXED argument
>there could be a better way of handling this by giving another hint to space_manager that we are >looking for a particular alignment. for huge pages and MAP_FIXED 4k alignment is not sufficient.

I was told people are reluctant to fiddle with aspacemgr :), so this worked for me, thought maybe others may find it useful as well.
Speaking of, are there any immediate consequences that you can foresee of taking away control over huge_pages from Valgrind?
Comment 6 Nathan Kurz 2014-09-18 18:15:19 UTC
*Well of course it works.*

From a user's point of view, it's not clear that this is always a negative.

If the issue is that Valgrind is unable to correctly monitor the page,
perhaps there could be a warning to this effect?  If the issue is that
performance is worse, I'd suggest this is better than not being able
to run at all.  Using MAP_HUGETLB if available to decide the correct
alignment would be fine, but it would be preferable to also support
allocations from hugetlbfs, since on older Linux kernels this is the
only way to get access to 1GB pages. If relying on MAP_HUGETLB, this
might be a useful testcase:
http://lxr.free-electrons.com/source/tools/testing/selftests/vm/thuge-gen.c,
as it tests both 2MB and 1GB hugepages.

On Thu, Sep 18, 2014 at 10:48 AM, Tom Hughes <tom@compton.nu> wrote:
> https://bugs.kde.org/show_bug.cgi?id=333051
>
> --- Comment #4 from Tom Hughes <tom@compton.nu> ---
> Well of course it works. it's just completely wrong because it removes
> valgrind's control over where things are mapped - the MAP_FIXED is added on
> purpose, to ensure that things are mapped where valgrind wants them.
>
> The correct fix is for aspacemgr to be told to select an address with the
> correct alignment, but that requires us being able to tell that we are dealing
> with huge pages.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 7 Philippe Waroquiers 2014-09-19 08:53:02 UTC
(In reply to Tom Hughes from comment #4)
> Well of course it works. it's just completely wrong because it removes
> valgrind's control over where things are mapped - the MAP_FIXED is added on
> purpose, to ensure that things are mapped where valgrind wants them.
I do not think it is completely wrong, even if not ideal.
As I understand, the aspacemgr tries to control where things are mapped
mostly for efficiency reasons on 64 bits platforms, due to the memcheck address
maps, which becomes a lot less efficient once some 'high addresses' are mapped.

With the patch, if the kernel did a mapping at a certain address, then normally, the
aspacemgr data structures should indicate that this zone is available for the client.
So, to as a last resort let the kernel decide of the mapping is somewhat similar
to have a good advice about where to do the mapping.
If then it is checked that the mapping is in fact acceptable (ie. is effectively at
a free place, usable by the client), then that will solve the problem an easy way.
Note that a similar trick (retry without the MMAP_FIXED) is already used
in aspacemgr for darwin and for the outer/inner setup.
Note that the aspacemgr trick is really a kludge, as this in fact bypasses an incorrect
knowledge of the mapping by the inner valgrind (due to the outer valgrind having
done a mmap not observed by the inner valgrind).
In the case of syswrap-generic.c, I however think that the aspacemgr state will be
fully correct after a successful (and verified as ok for the client) mapping done by the kernel.
What the attached patch misses is a verification that the resulting mmap is effectively ok.
See also the discussion in http://sourceforge.net/p/valgrind/mailman/message/32167895/
and following posts.

> 
> The correct fix is for aspacemgr to be told to select an address with the
> correct alignment, but that requires us being able to tell that we are
> dealing with huge pages.
That would effectively be better, but implies to learn to valgrind the
concept of hugepages and/or huge page file systems and/or ... (whatever the way
this is working).
The trick to "ask the kernel an advice and do the mmap at the same time" seems 
a lot simpler/portable, and looks good enough (assuming I have not missed a problem
with the approach of retrying without MMAP_FIXED)
Comment 8 Philippe Waroquiers 2014-11-01 21:03:10 UTC
Fixed in revision 14682