Bug 275673

Summary: valgrind: m_libcfile.c:73 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed when valgrinding Wine
Product: [Developer tools] valgrind Reporter: Austin English <austinenglish>
Component: generalAssignee: Julian Seward <jseward>
Status: RESOLVED DUPLICATE    
Severity: crash CC: kde-bugs, philippe.waroquiers, tom
Priority: NOR    
Version: 3.7 SVN   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: patch to help investigate safe_fd exhaustion
valgrind output with patch
readlink output
valgrind output with both patches
strace -f valgrind
Improve support for prlimit64

Description Austin English 2011-06-15 00:21:16 UTC
Version:           3.7 SVN
OS:                Linux

valgrind: m_libcfile.c:73 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.
==11126==    at 0x3802988D: report_and_quit (m_libcassert.c:210)
==11126==    by 0x380299E5: vgPlain_assert_fail (m_libcassert.c:284)
==11126==    by 0x3802A7A8: vgPlain_safe_fd (m_libcfile.c:73)
==11126==    by 0x38073428: vgModuleLocal_sema_init (sema.c:60)
==11126==    by 0x38071371: sched_fork_cleanup (scheduler.c:526)
==11126==    by 0x3802D56A: vgPlain_do_atfork_child (m_libcproc.c:749)
==11126==    by 0x38084DA2: vgModuleLocal_do_fork_clone (syswrap-linux.c:369)
==11126==    by 0x380A72C9: vgSysWrap_x86_linux_sys_clone_before (syswrap-x86-linux.c:945)
==11126==    by 0x38074AF8: vgPlain_client_syscall (syswrap-main.c:1498)
==11126==    by 0x38071022: handle_syscall (scheduler.c:952)
==11126==    by 0x38072600: vgPlain_scheduler (scheduler.c:1196)
==11126==    by 0x38084774: run_a_thread_NORETURN (syswrap-linux.c:95)

Reproducible: Always

Steps to Reproduce:
I run the wine conformance tests under wine on Debian Linux (32-bit), with wine from git/valgrind from svn. I updated valgrind a couple days ago (last update was a few weeks ago, though I don't have the svn revision off hand). Now, running the tests crashes with the above assertion.

If a regression test is needed, please let me know. I'll attach the full log from valgrinding wine.

Actual Results:  
Valgrind fails an assertion.

Expected Results:  
Assertion doesn't fail, only wine errors are shown :).

The wine errors, while not ideal, don't seem to be the problem here. I ran the tests against wine git on Monday afternoon, which worked. Monday night I updated valgrind, which started the errors.
Comment 1 Christian Borntraeger 2011-06-15 09:24:27 UTC
On 15/06/11 00:21, Austin English wrote:
> https://bugs.kde.org/show_bug.cgi?id=275673
> 
>            Summary: valgrind: m_libcfile.c:73 (vgPlain_safe_fd): Assertion
>                     'newfd >= VG_(fd_hard_limit)' failed when valgrinding
>                     Wine

That should be the same problem that I have seen. A fix is here:

https://bugs.kde.org/show_bug.cgi?id=214909#c103
(https://bugs.kde.org/attachment.cgi?id=60909)
Comment 2 Tom Hughes 2011-06-15 09:42:57 UTC
Well that may be the problem if the user is running with the gdbserver enabled, yes. If they're not then it won't be.
Comment 3 Philippe Waroquiers 2011-06-15 10:37:10 UTC
(In reply to comment #2)
> Well that may be the problem if the user is running with the gdbserver enabled,
> yes. If they're not then it won't be.
--vgdb arg has 3 values: no/yes/full, default value is yes.
So, it really looks the same bug (unless the wine tests have been started
with --vgdb=yes).

To verify if this is the same problem, the easiest is to try with --vgdb=yes.

A side question: when starting wine under Valgrind with gdbserver activated,
is gdb able to debug wine ?

Philippe
Comment 4 Philippe Waroquiers 2011-06-15 10:45:07 UTC
(In reply to comment #3)

> So, it really looks the same bug (unless the wine tests have been started
> with --vgdb=yes).
Humph, replace by --vgdb=no
 
> To verify if this is the same problem, the easiest is to try with --vgdb=yes.
Re-humph, replace by --vgdb=no
Comment 5 Austin English 2011-06-16 09:41:03 UTC
(In reply to comment #1)
> On 15/06/11 00:21, Austin English wrote:
> > https://bugs.kde.org/show_bug.cgi?id=275673
> > 
> >            Summary: valgrind: m_libcfile.c:73 (vgPlain_safe_fd): Assertion
> >                     'newfd >= VG_(fd_hard_limit)' failed when valgrinding
> >                     Wine
> 
> That should be the same problem that I have seen. A fix is here:
> 
> https://bugs.kde.org/show_bug.cgi?id=214909#c103
> (https://bugs.kde.org/attachment.cgi?id=60909)

Makes no difference here.

(In reply to comment #4)
> (In reply to comment #3)
> 
> > So, it really looks the same bug (unless the wine tests have been started
> > with --vgdb=yes).
> Humph, replace by --vgdb=no
> 
> > To verify if this is the same problem, the easiest is to try with --vgdb=yes.
> Re-humph, replace by --vgdb=no

Same problem.

For reference, valgrind options used were:
export VALGRIND_OPTS="--vgdb=no -q --trace-children=yes --track-origins=yes --gen-suppressions=all --suppressions=$WINESRC/tools/valgrind/valgrind-suppressions --suppressions=$WINESRC/tools/valgrind/gst.supp --leak-check=full --num-callers=20  --workaround-gcc296-bugs=yes --vex-iropt-precise-memory-exns=yes"
Comment 6 Philippe Waroquiers 2011-06-16 09:54:46 UTC
(In reply to comment #5)
> Same problem.
> For reference, valgrind options used were:
> export VALGRIND_OPTS="--vgdb=no -q --trace-children=yes --track-origins=yes
> --gen-suppressions=all
> --suppressions=$WINESRC/tools/valgrind/valgrind-suppressions
> --suppressions=$WINESRC/tools/valgrind/gst.supp --leak-check=full
> --num-callers=20  --workaround-gcc296-bugs=yes
> --vex-iropt-precise-memory-exns=yes"

With the patch and/or with --vgdb=no, it is unlikely that the
problem is caused by gdbserver.

Yesterday evening, I cloned the wine git and compiled/installed wine, and
I have run w32 'sort.exe /?' executable under Valgrind under wine without
problem.

Is there an easy way to reproduce the problem ? (i.e. how to launch
a wine test which produces the above problem).
I can then try to further investigate

Philippe
Comment 7 Austin English 2011-06-16 23:37:48 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > Same problem.
> > For reference, valgrind options used were:
> > export VALGRIND_OPTS="--vgdb=no -q --trace-children=yes --track-origins=yes
> > --gen-suppressions=all
> > --suppressions=$WINESRC/tools/valgrind/valgrind-suppressions
> > --suppressions=$WINESRC/tools/valgrind/gst.supp --leak-check=full
> > --num-callers=20  --workaround-gcc296-bugs=yes
> > --vex-iropt-precise-memory-exns=yes"
> 
> With the patch and/or with --vgdb=no, it is unlikely that the
> problem is caused by gdbserver.
> 
> Yesterday evening, I cloned the wine git and compiled/installed wine, and
> I have run w32 'sort.exe /?' executable under Valgrind under wine without
> problem.
> 
> Is there an easy way to reproduce the problem ? (i.e. how to launch
> a wine test which produces the above problem).
> I can then try to further investigate
> 

I was running the full test suite, using this script:
http://code.google.com/p/winezeug/source/browse/trunk/valgrind/valgrind-daily.sh

though for a quick method, you can use something like:
#!/bin/sh
export WINESRC="$HOME/wine-git"
export CFLAGS="-g -O0 -fno-inline"
export VALGRIND_OPTS="-q --trace-children=yes --track-origins=yes --gen-suppressions=all --suppressions=$WINESRC/tools/valgrind/valgrind-suppressions --leak-check=full --num-callers=20  --workaround-gcc296-bugs=yes --vex-iropt-precise-memory-exns=yes"
export WINETEST_WRAPPER=valgrind
export WINE_HEAP_TAIL_REDZONE=32
cd $WINESRC
./configure && make
make -k test
----

I don't know a particular test off hand that has the bug, but if still needed, I can find you one this evening (US/Pacific Time).
Comment 8 Philippe Waroquiers 2011-06-16 23:42:24 UTC
(In reply to comment #7)

> I don't know a particular test off hand that has the bug, but if still needed,
> I can find you one this evening (US/Pacific Time).

If you can point at a specific test, that would be useful.
I will then dig more in details this week-end.
Comment 9 Philippe Waroquiers 2011-06-17 06:17:28 UTC
(In reply to comment #8)

> If you can point at a specific test, that would be useful.
> I will then dig more in details this week-end.

Note that when trying to run the tests, I had a problem because
the suppression file given in the valgrind opts is not present:
philippe@gcc10:~/wine-git$ find . -name '*suppr*'
philippe@gcc10:~/wine-git$ 

(I can download it from http://winezeug.googlecode.com/svn/trunk/valgrind/valgrind-suppressions
but maybe this missing file means I did an error when downloading
and building wine from git).

When you have a specific test reproducing the problem, I will further
investigate.
Comment 10 Austin English 2011-06-17 18:53:22 UTC
(In reply to comment #9)
> (In reply to comment #8)
> 
> > If you can point at a specific test, that would be useful.
> > I will then dig more in details this week-end.
> 
> Note that when trying to run the tests, I had a problem because
> the suppression file given in the valgrind opts is not present:
> philippe@gcc10:~/wine-git$ find . -name '*suppr*'
> philippe@gcc10:~/wine-git$ 
> 
> (I can download it from
> http://winezeug.googlecode.com/svn/trunk/valgrind/valgrind-suppressions
> but maybe this missing file means I did an error when downloading
> and building wine from git).

No, you're right, my apologies for not being more clear.

> When you have a specific test reproducing the problem, I will further
> investigate.

Got one, dlls/advapi32/tests/security.c. To make it slightly easier to reproduce, open two terminals:
Terminal A)
$ cd $HOME/wine-git
$ ./wine winemine

Terminal B)
$ cd $HOME/wine-git/dlls/advapi32/tests
$ . ~/vg-wrapper # The script with those exports I posted earlier
$ make security.ok

Hope that helps. Thanks for looking!
Comment 11 Philippe Waroquiers 2011-06-18 02:02:28 UTC
(In reply to comment #10)

> $ make security.ok
> 
> Hope that helps. Thanks for looking!

I was able to run this test (multiple times, including without
the winemine), but could not reproduce the problem.
I re-introduced the gdbserver bug. Then the gdbserver test reproducing
this bug failed again, but the wine regression test still did not fail.

The way I attacked the gdbserver bug was to put a debug trace in
VG_(safe_fd), outputting the stack trace of valgrind and the stack
trace of all threads.
This helped me to understand where the safe_fd exhaustion was coming from.

For these stack traces, the easiest is 
copy/paste/modify the function __attribute__ ((noreturn))
static void report_and_quit ( const Char* report,
                              UnwindStartRegs* startRegsIN )

in m_libcassert.c so as to obtain a non static function
report_and_continue.

Then if you call this from safe_fd (maybe also print the resulting fd),
you might understand where all this safe_fd are coming from.

You might also need to put some tracing in VG_(close) or similar to
understand more about which ones are left opened.

Sorry I can't help more if problem cannot be reproduced here ...

(if I have more time/courage, I will take a look at extending --track-fds to
also track the valgrind internal FDs, but in the meantime, the above
addition of traces is the only idea I have).

Philippe
Comment 12 Austin English 2011-06-18 05:35:25 UTC
(In reply to comment #11)
> (In reply to comment #10)
> 
> > $ make security.ok
> > 
> > Hope that helps. Thanks for looking!
> 
> I was able to run this test (multiple times, including without
> the winemine), but could not reproduce the problem.
> I re-introduced the gdbserver bug. Then the gdbserver test reproducing
> this bug failed again, but the wine regression test still did not fail.
> 
> The way I attacked the gdbserver bug was to put a debug trace in
> VG_(safe_fd), outputting the stack trace of valgrind and the stack
> trace of all threads.
> This helped me to understand where the safe_fd exhaustion was coming from.
> 
> For these stack traces, the easiest is 
> copy/paste/modify the function __attribute__ ((noreturn))
> static void report_and_quit ( const Char* report,
>                               UnwindStartRegs* startRegsIN )
> 
> in m_libcassert.c so as to obtain a non static function
> report_and_continue.
> 
> Then if you call this from safe_fd (maybe also print the resulting fd),
> you might understand where all this safe_fd are coming from.
> 
> You might also need to put some tracing in VG_(close) or similar to
> understand more about which ones are left opened.
> 
> Sorry I can't help more if problem cannot be reproduced here ...
> 
> (if I have more time/courage, I will take a look at extending --track-fds to
> also track the valgrind internal FDs, but in the meantime, the above
> addition of traces is the only idea I have).
> 
> Philippe

Looking at my logs, it seems to consistently cause wine to hang on ole32/moniker.c, you might have better luck with that test.
Comment 13 Philippe Waroquiers 2011-06-18 09:06:18 UTC
(In reply to comment #12)
> Looking at my logs, it seems to consistently cause wine to hang on
> ole32/moniker.c, you might have better luck with that test.

Executed moniker two ways: with a separate winemine, and a full run (no winemine).
In both cases, it was running under Valgrind, and reached the end without
producing the 'safe_fd' error. (in the second case, there was more Valgrind
processes).

Here is some details about the full run :
philippe@gcc10:~/wine-git/dlls/ole32$ rm moniker.ok; make moniker.ok > r_full.out 2>&1
philippe@gcc10:~/wine-git/dlls/ole32$ tail r_full.out 
   fun:RtlAllocateHeap
   fun:rpcrt4_conn_np_alloc
   fun:RPCRT4_CreateConnection
   fun:rpcrt4_protseq_ncalrpc_open_endpoint
   fun:RPCRT4_use_protseq
   fun:RpcServerUseProtseqEpExW
   fun:RpcServerUseProtseqEpW
   fun:RPCSS_Initialize
   fun:main
}
philippe@gcc10:~/wine-git/dlls/ole32$ grep hard r_full.out 
==4880==    by 0xD61F510: harddisk_driver_entry (device.c:962)
   fun:harddisk_driver_entry
philippe@gcc10:~/wine-git/dlls/ole32$ ls -lrt r*out
-rw-r--r-- 1 philippe philippe 134847 2011-06-18 08:54 r.out
-rw-r--r-- 1 philippe philippe 485231 2011-06-18 08:59 r_full.out
philippe@gcc10:~/wine-git/dlls/ole32$ tail r.out
   fun:RtlAllocateHeap
   fun:rpcrt4_conn_np_alloc
   fun:RPCRT4_CreateConnection
   fun:rpcrt4_protseq_ncalrpc_open_endpoint
   fun:RPCRT4_use_protseq
   fun:RpcServerUseProtseqEpExW
   fun:RpcServerUseProtseqEpW
   fun:RPCSS_Initialize
   fun:main
}
philippe@gcc10:~/wine-git/dlls/ole32$ grep fd_hard r*out
philippe@gcc10:~/wine-git/dlls/ole32$ 

So, still no better idea that adding trace in safe_fd.
Comment 14 Austin English 2011-06-18 09:10:00 UTC
I'm not terribly familiar with the valgrind codebase, I don't suppose you have a diff I could apply and test?
Comment 15 Julian Seward 2011-06-18 09:15:20 UTC
(In reply to comment #12)

> Looking at my logs, it seems to consistently cause wine to hang on
> ole32/moniker.c, you might have better luck with that test.

I don't know if this is related, but I also see builds of recent
Firefox hanging when controlled by GDB.  I can attach in the normal
way, with --vgdb-error=0, place a breakpoint at main, and it does
break there.  However, when I continue it hangs.

After main, firefox forks (but does not exec).  The child process
soon exits with a segfault, and the hang happens after that
(immediately afterwards, I assume).

Does that make any sense?  Could there be some kind of problem
with fork + child exiting with a signal?

Also: is there any possible bad interaction between the GDBserver
functionality and --smc-check=all ?  This flag causes V to check every
SB that is run, to detect cases where the code from which the SB was
made has changed.  In such cases the translation is discarded and
re-made.

I can send exact how-to-repro info if you want.

What it looks like when running --tool=none is:



LD_LIBRARY_PATH=/space2/sewardj/MOZ/MC-09-06-2011-JEM/ff-opt/toolkit/library:/space2/sewardj/MOZ/MC-09-06-2011-JEM/ff-opt/js/src:/space2/sewardj/MOZ/MC-09-06-2011-JEM/ff-opt/dist/lib:
EXE=./ff-opt/dist/bin/firefox-bin -P dev -no-remote

==15876== Nulgrind, the minimal Valgrind tool
==15876== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote.
==15876== Using Valgrind-3.7.0.SVN and LibVEX; rerun with -h for copyright info
==15876== Command: ./ff-opt/dist/bin/firefox-bin -P dev -no-remote
==15876== 
==15876== (action at startup) vgdb me ... 
==15876== 
==15876== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==15876==   /path/to/gdb ./ff-opt/dist/bin/firefox-bin
==15876== and then give GDB the following command
==15876==   target remote |
/home/sewardj/VgTRUNK/merge/Inst/lib/valgrind/../../bin/vgdb --pid=15876
==15876== --pid is optional if only one valgrind process is running
==15876== 
(so I start GDB in another shell, as directed, and do "b main", then
"cont", then we are waiting at main(), then "cont", and then the below
happens and it hangs:
==15888== 
==15888== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==15888==   /path/to/gdb ./ff-opt/dist/bin/firefox-bin
==15888== and then give GDB the following command
==15888==   target remote |
/home/sewardj/VgTRUNK/merge/Inst/lib/valgrind/../../bin/vgdb --pid=15888
==15888== --pid is optional if only one valgrind process is running
==15888== 
==15888== 
==15888== Process terminating with default action of signal 11 (SIGSEGV)
==15888==  Access not within mapped region at address 0x0
==15888==    at 0x5431593: glxtest() (glxtest.cpp:172)
==15888==    by 0x5431779: fire_glxtest_process() (glxtest.cpp:234)
==15888==    by 0x542881E: XRE_main (nsAppRunner.cpp:2752)
==15888==    by 0x401BA4: main (nsBrowserApp.cpp:162)
==15888==  If you believe this happened as a result of a stack
==15888==  overflow in your program's main thread (unlikely but
==15888==  possible), you can try to increase the size of the
==15888==  main thread stack using the --main-stacksize= flag.
==15888==  The main thread stack size used in this run was 8388608.
==15888==
Comment 16 Philippe Waroquiers 2011-06-18 09:53:55 UTC
Created attachment 61107 [details]
patch to help investigate safe_fd exhaustion

The patch does two things:
Each time safe_fd is called, it prints the oldfd, the newfd, fg_hard_limit
and outputs the scheduler status.

The second thing is: just before exiting due to an internal error, it activates
the gdbserver. This will freeze the process, allowing either
to debug the guest process (probably not very interesting in this case)
or do any other command before the process dies.

What can be done in this case is then:
for f in /proc/8689/fd/*; do echo $f; readlink $f; done

(replace 8689 by the process which is close to die).
this will give information about fd which are not closed.
With the previous tracing, it should give a better idea
about where these safe_fd are coming from.
Comment 17 Philippe Waroquiers 2011-06-18 10:11:46 UTC
(In reply to comment #15)

> I don't know if this is related, but I also see builds of recent
> Firefox hanging when controlled by GDB.  I can attach in the normal
> way, with --vgdb-error=0, place a breakpoint at main, and it does
> break there.  However, when I continue it hangs.

I understand that the problem Austin has is a crash due to safe_fd exhaustion
causing an assert.
(which is what the patch uploaded in comment 14 should help to investigate).
Also, in his case, there is no usage of gdbserver (and problem also happens
with --vgdb=no).

> Does that make any sense?  Could there be some kind of problem
> with fork + child exiting with a signal?
What might happen is the following:
  if a gdb is attached to the parent process firefox gdbserver
  and the child dies
  then IIRC, the parent might receive a signal SIGCHLD
  If you are connected to a gdbserver of the parent, then gdbserver
  should report the signal to gdb, where you have to continue the parent.
  Or possibly the child is waiting for gdb to connect ?

> 
> Also: is there any possible bad interaction between the GDBserver
> functionality and --smc-check=all ?  This flag causes V to check every
> SB that is run, to detect cases where the code from which the SB was
> made has changed.  In such cases the translation is discarded and
> re-made.

For what concerns the interaction between --smc-check=all and gdbserver:
When inserting a break somewhere (or single stepping), the translation will
be discarded by gdbserver to be re-instrumented.
I do not think there is anything special with smc code.
Is it different if you use the new -smc-check value ("all non file") ?

I did not recently tried to debug firefox with Valgrind gdbserver,
I will do that this WE and give feedback.
Comment 18 Austin English 2011-06-18 10:15:35 UTC
Created attachment 61108 [details]
valgrind output with patch
Comment 19 Austin English 2011-06-18 10:16:10 UTC
Created attachment 61109 [details]
readlink output

ps showed two valgrind processes, I've attached the readlink output for both.
Comment 20 Philippe Waroquiers 2011-06-18 13:54:59 UTC
(In reply to comment #19)
> Created an attachment (id=61109) [details]
> readlink output
> 
> ps showed two valgrind processes, I've attached the readlink output for both.

Looking at the readlink output at the moment of the crash for the 2 processes,
This is the situation at crash time (for the safe_fd range):
28814 (busy forking 28823)
1014 /home/austin/wine-git/loader/wine-preloader
1015 /tmp/valgrind_proc_28814_cmdline_d11c5e42 (deleted)
1016 pipe
1017 pipe
1018 pipe
1019 /tmp/vgdb-pipe-from-vgdb-to-28814

28323
1014 /home/austin/wine-git/loader/wine-preloader
1015 /tmp/valgrind_proc_28814_cmdline_d11c5e42 (deleted)
1016 pipe
1019 /tmp/vgdb-pipe-from-vgdb-to-28814

So, in terms of available safe_fd in process 28323, there should be
no problem. If the unix kernel is correct, then at least 1017 and 1018
are free to be used by safe_fd, as they were properly usable by 28814 and
are not marked as busy in 28323.
So, a mystery.

Maybe the    newfd = VG_(fcntl)(oldfd, VKI_F_DUPFD, VG_(fd_hard_limit));
in VG_(safe_fd) fails for whatever strange reason.
One reason might be the max limit of opened fd (very unlikely, as 28323
has about 25 FDs opened only according to readlink output).
In any case, what is the output of ulimit -a ?

Otherwise, assuming your ulimit is high, can you apply the following
patch and re-run ? This should give the errno explaining fcntl failure.

Index: coregrind/m_libcfile.c
===================================================================
--- coregrind/m_libcfile.c	(revision 11822)
+++ coregrind/m_libcfile.c	(working copy)
@@ -383,6 +383,9 @@
 #  else
 #    error "Unknown OS"
 #  endif
+   if (sr_isError(res))
+      VG_(printf) ("fcntl failure fd %d cmd %d err %lldn",
+                   fd, cmd, (Long)sr_Err(res));
    return sr_isError(res) ? -1 : sr_Res(res);
 }
 
(above patch compiled, but no printf executed here).
Alternatively, you might launch valgrind using
strace -f valgrind .....
Comment 21 Austin English 2011-06-18 17:41:44 UTC
(In reply to comment #20)
> Maybe the    newfd = VG_(fcntl)(oldfd, VKI_F_DUPFD, VG_(fd_hard_limit));
> in VG_(safe_fd) fails for whatever strange reason.
> One reason might be the max limit of opened fd (very unlikely, as 28323
> has about 25 FDs opened only according to readlink output).
> In any case, what is the output of ulimit -a ?

austin@debian:~$ uname -a
Linux debian 2.6.38-2-686 #1 SMP Sun May 8 14:49:45 UTC 2011 i686 GNU/Linux
austin@debian:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 6981
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 6981
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
 
> Otherwise, assuming your ulimit is high, can you apply the following
> patch and re-run ? This should give the errno explaining fcntl failure.
> 
> Index: coregrind/m_libcfile.c
> ===================================================================
> --- coregrind/m_libcfile.c    (revision 11822)
> +++ coregrind/m_libcfile.c    (working copy)
> @@ -383,6 +383,9 @@
>  #  else
>  #    error "Unknown OS"
>  #  endif
> +   if (sr_isError(res))
> +      VG_(printf) ("fcntl failure fd %d cmd %d err %lldn",
> +                   fd, cmd, (Long)sr_Err(res));
>     return sr_isError(res) ? -1 : sr_Res(res);
>  }

sched status during safe_fd
safe_fd oldfd 3 newfd 1019 fd_hard_limit 1014
preloader: Warning: failed to reserve range 00110000-68000000
fcntl failure fd 20 cmd 0 err 22n==30665==    at 0x38029992: report_and_continue (m_libcassert.c:264)

...

sched status during safe_fd
safe_fd oldfd 20 newfd -1 fd_hard_limit 1014
fcntl failure fd -1 cmd 2 err 9n
valgrind: m_libcfile.c:76 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.

I'll attach full trace again.
Comment 22 Austin English 2011-06-18 17:42:17 UTC
Created attachment 61118 [details]
valgrind output with both patches
Comment 23 Philippe Waroquiers 2011-06-18 18:27:21 UTC
(In reply to comment #21)

> sched status during safe_fd
> safe_fd oldfd 3 newfd 1019 fd_hard_limit 1014
> preloader: Warning: failed to reserve range 00110000-68000000
> fcntl failure fd 20 cmd 0 err 22n==30665==    at 0x38029992:
> report_and_continue (m_libcassert.c:264)
22 = EINVAL.
According to man fcntl, it means that the 3arg is either
negative or bigger than max allowed ???
But we do not properly see the output (seems to have a mixup
between various processes output).

I guess we will need to go to
strace -f valgrind ....
to have the full set of syscall results before.


> 
> ...
> 
> sched status during safe_fd
> safe_fd oldfd 20 newfd -1 fd_hard_limit 1014
> fcntl failure fd -1 cmd 2 err 9n
> valgrind: m_libcfile.c:76 (vgPlain_safe_fd): Assertion 'newfd >=
> VG_(fd_hard_limit)' failed.

This is the consequence of the dup failure.
Comment 24 Austin English 2011-06-18 18:36:11 UTC
Created attachment 61120 [details]
strace -f valgrind

4 MB uncompressed.

I'll be away from this computer for about a week, but I can try reproducing it on another machine if needed.
Comment 25 Philippe Waroquiers 2011-06-19 10:38:38 UTC
(In reply to comment #24)
> Created an attachment (id=61120) [details]
> strace -f valgrind
> 
> 4 MB uncompressed.
> 
> I'll be away from this computer for about a week, but I can try reproducing it
> on another machine if needed.

From the strace output, it looks like fcntl fails with EINVAL, but
it should not according to the manual.

From the full strace, I cannot understand what is going on.
After some cleanup (e.g. removing calls to sigprocmask, write, etc),
here is a working pipe then fcntl:

strace.txt:10806:clone(Process 2560 attached
strace.txt:10807:child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x4199728) = 2560
strace.txt:10809:[pid  2560] gettid()                    = 2560
strace.txt:11838:[pid  2560] close(1017)                 = 0
strace.txt:11839:[pid  2560] close(1018)                 = 0
strace.txt:11840:[pid  2560] pipe([5, 6])                = 0
strace.txt:11841:[pid  2560] fcntl(5, F_DUPFD, 1014)     = 1017
strace.txt:11842:[pid  2560] gettid()                    = 2560
strace.txt:11923:[pid  2560] close(5)                    = 0
strace.txt:11924:[pid  2560] fcntl(1017, F_SETFD, FD_CLOEXEC) = 0
strace.txt:11925:[pid  2560] fcntl(6, F_DUPFD, 1014)     = 1018
strace.txt:11926:[pid  2560] gettid()                    = 2560
strace.txt:12007:[pid  2560] close(6)                    = 0
strace.txt:12008:[pid  2560] fcntl(1018, F_SETFD, FD_CLOEXEC) = 0
strace.txt:12010:[pid  2560] gettid()                    = 2560
strace.txt:12011:[pid  2560] read(1017, "A", 1)          = 1
strace.txt:12012:[pid  2560] close(1019)                 = 0
strace.txt:12014:[pid  2560] mknod("/tmp/vgdb-pipe-from-vgdb-to-2560", S_IFIFO|0666) = 0
strace.txt:12015:[pid  2560] mknod("/tmp/vgdb-pipe-to-vgdb-from-2560", S_IFIFO|0666) = 0
strace.txt:12016:[pid  2560] open("/tmp/vgdb-pipe-shared-mem-vgdb-2560", O_RDWR|O_CREAT, 0666) = 5
....

And here is the failing pipe/fcntl:
strace.txt:60232:[pid  2556] clone(Process 2563 attached
strace.txt:60233:child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x42f7938) = 2563
strace.txt:60235:[pid  2563] gettid()                    = 2563
strace.txt:61322:[pid  2563] close(1017)                 = 0
strace.txt:61323:[pid  2563] close(1018)                 = 0
strace.txt:61324:[pid  2563] pipe([20, 21])              = 0
strace.txt:61325:[pid  2563] fcntl(20, F_DUPFD, 1014)    = -1 EINVAL (Invalid argument)
strace.txt:61327:[pid  2563] gettid()                    = 2563
strace.txt:61408:[pid  2563] fcntl(-1, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)
strace.txt:61413:[pid  2563] gettid()                    = 2563
strace.txt:61509:[pid  2563] poll([{fd=1019, events=POLLIN}], 1, -1

man fcntl tells that for a DUP, EINVAL is returned if the 3rd arg is negative
or bigger than the max allowed. The 3rd arg is not negative, and all
the calls in the strace are setting the max no files to 1024. Moreover,
1017 and 1018 have just be closed above.

From the traces captured till now, I have no idea anymore what to get.

Maybe try on another computer ?
Another linux version ?
Maybe do an strace including the other wine processes (i.e.
without winemine on the side) just to see really
nobody does an setrlimit somewhere ?
If the problem originates from a valgrind change, do a bissect to find
the revision introducing the problem ?
Comment 26 Austin English 2011-06-20 01:20:04 UTC
Doesn't occur on my Gentoo machine (also 32-bit) with 2.6.37 with valgrind/svn and wine/git.

I'll be away from that machine for a week, I'll try to get a debian testing machine here and see if it has the same problem. Otherwise, I'll do the bisect next week.

Cheers.
Comment 27 Andrew Nguyen 2011-07-01 01:58:04 UTC
(In reply to comment #25)
> From the traces captured till now, I have no idea anymore what to get.
> 
> Maybe try on another computer ?
> Another linux version ?

I don't experience the assertion that Austin sees with my 32-bit Ubuntu 10.04 machine or a 64-bit Fedora 14 virtual machine. I do see the problem on my 64-bit Fedora 15 machine.

> If the problem originates from a valgrind change, do a bissect to find
> the revision introducing the problem ?

I wasn't able to perform the bisect on my Fedora 15 machine because the configure script in older revisions rejected glibc 2.14, but an educated guess pinpointed revision 11806 as the culprit, and reverting it resolves the assertion failure.
Comment 28 Philippe Waroquiers 2011-07-03 22:02:28 UTC
(In reply to comment #27)

> I wasn't able to perform the bisect on my Fedora 15 machine because the
> configure script in older revisions rejected glibc 2.14, but an educated guess
> pinpointed revision 11806 as the culprit, and reverting it resolves the
> assertion failure.


revision 11806 adds the support for a new system call (prlimit64).

Is this syscall called by wine (to get or set the nr of FDs ?)
Comment 29 Andrew Nguyen 2011-07-13 20:07:20 UTC
(In reply to comment #28)
> (In reply to comment #27)
> 
> > I wasn't able to perform the bisect on my Fedora 15 machine because the
> > configure script in older revisions rejected glibc 2.14, but an educated guess
> > pinpointed revision 11806 as the culprit, and reverting it resolves the
> > assertion failure.
> 
> 
> revision 11806 adds the support for a new system call (prlimit64).
> 
> Is this syscall called by wine (to get or set the nr of FDs ?)

I don't believe it does so directly, to the best of my knowledge. My suspicion is that the call comes from newer glibc.
Comment 30 Matthias Schwarzott 2012-02-12 00:31:11 UTC
Created attachment 68711 [details]
Improve support for prlimit64

This patch improves prlimit64 support.
It tries to make the code work like the existing setrlimit/getrlimit.
The support can still be improved.
Currently arg1 (pid) is completely ignored (and assumed the current process is meant).
Comment 31 Matthias Schwarzott 2012-02-19 21:00:49 UTC
Comment on attachment 68711 [details]
Improve support for prlimit64

If this is caused by bad prlimit64 support, there now is an extra bug for it: Bug 294047
Comment 32 Austin English 2012-09-17 19:46:59 UTC
Still in 3.7.0.
Comment 33 Philippe Waroquiers 2012-09-19 19:09:20 UTC
(In reply to comment #32)
> Still in 3.7.0.
The improvement to prlimit64 was committed only in 3.8.0.
Can you retry with 3.8.0 or 3.8.1 ?
Thanks
Comment 34 Austin English 2012-09-20 01:40:58 UTC
(In reply to comment #33)
> (In reply to comment #32)
> > Still in 3.7.0.
> The improvement to prlimit64 was committed only in 3.8.0.
> Can you retry with 3.8.0 or 3.8.1 ?
> Thanks

Wonderful! Working fine in 3.8.1, thanks!
Comment 35 Philippe Waroquiers 2012-09-24 20:52:12 UTC
duplicate of bug 294047

*** This bug has been marked as a duplicate of bug 294047 ***