246152 – callgrind internal error after pthread_cancel on 32 Bit Linux

Bug 246152 - callgrind internal error after pthread_cancel on 32 Bit Linux

Summary: callgrind internal error after pthread_cancel on 32 Bit Linux

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	callgrind (show other bugs)
Version:	3.6 SVN
Platform:	Ubuntu Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Josef Weidendorfer

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-07-29 14:15 UTC by walter.stocker
Modified:	2011-02-08 01:13 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:

Attachments
testcase (1.11 KB, text/plain) 2010-07-29 14:30 UTC, walter.stocker	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description walter.stocker 2010-07-29 14:15:37 UTC

Version:           3.6 SVN
OS:                Linux

The system where this failure occurs is kubuntu-10.04 (lucid) 32 Bit
Linux 2.6.32-21-generic #32-Ubuntu SMP Fri Apr 16 08:10:02 UTC 2010 i686 GNU/Linux
gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
/lib/tls/i686/cmov/libc.so.6 -> libc-2.11.1.so
valgrind-3.6.0.SVN-Debian

valgrind --tool=callgrind finishes with an internal error message (see below), and with an empty callgrind.out file after the pthread_cancel call on the 32 Bit system.

Other valgrind tools (memcheck, cachegrind, helgrind) run through without failure.

On a 64 Bit system (also kubuntu-10.04) the callgrind works as expected (see below in the expected results) - i.e. the failure occurs only on the 32 Bit system.
The 64 Bit system used to test:
Linux 2.6.32-21-generic #32-Ubuntu SMP Fri Apr 16 08:09:38 UTC 2010 x86_64 GNU/Linux
gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
/lib/libc.so.6 -> libc-2.11.1.so
valgrind-3.6.0.SVN-Debian


See the attached testcase (a C source file), which is the minimal testcase I came up with: main starts a thread, which waits on a condition variable and then main cancels the thread.

Reproducible: Always

Steps to Reproduce:
# build the testcase on a 32 Bit system
gcc -g -lpthread -o threads threads.c
# run callgrind
valgrind --tool=callgrind ./threads


Actual Results:  
==19820== Callgrind, a call-graph generating cache profiler
==19820== Copyright (C) 2002-2009, and GNU GPL'd, by Josef Weidendorfer et al.
==19820== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==19820== Command: ./threads
==19820==
==19820== For interactive control, run 'callgrind_control -h'.
create thread
-- thread waiting on condition
cancel thread
==19820==
==19820== Process terminating with default action of signal 11 (SIGSEGV)
==19820==  Access not within mapped region at address 0xEA310920
==19820==    at 0x41160A9: madvise (syscall-template.S:82)
==19820==    by 0x4119A0D: clone (clone.S:130)
==19820==  If you believe this happened as a result of a stack
==19820==  overflow in your program's main thread (unlikely but
==19820==  possible), you can try to increase the size of the
==19820==  main thread stack using the --main-stacksize= flag.
==19820==  The main thread stack size used in this run was 8388608.
==19820==
--19820-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting
--19820-- si_code=1;  Faulting address: 0x10;  sp: 0x62a9e04c

valgrind: m_signals.c:2332 (sync_signalhandler_from_kernel): Assertion 'tid != 0' failed.
==19820==    at 0x3801D5A5: report_and_quit (m_libcassert.c:191)

sched status:
  running_tid=1

Thread 2: status = VgTs_Zombie
==19820==    at 0x41160A9: madvise (syscall-template.S:82)
==19820==    by 0x4119A0D: clone (clone.S:130)


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.


Expected Results:  
==13145== Callgrind, a call-graph generating cache profiler
==13145== Copyright (C) 2002-2009, and GNU GPL'd, by Josef Weidendorfer et al.
==13145== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==13145== Command: ./threads
==13145==
==13145== For interactive control, run 'callgrind_control -h'.
create thread
-- thread waiting on condition
cancel thread
thread joined
==13145==
==13145== Events    : Ir
==13145== Collected : 269139
==13145==
==13145== I   refs:      269,139

Comment 1 walter.stocker 2010-07-29 14:30:03 UTC

Created attachment 49635 [details]
testcase

Comment 2 Josef Weidendorfer 2010-07-29 15:03:42 UTC

Hmm. I can not reproduce this on a 32bit openSUSE 11.3
(glibc 2.11.2, gcc 4.5.0) with current trunk (r11239).

Can you try to reproduce with a self-compiled version of current
trunk?

Comment 3 walter.stocker 2010-07-29 16:45:23 UTC

I followed your suggestion and fetched the current trunk, built it and tried it out.
The result is the same internal error:

==13909== Callgrind, a call-graph generating cache profiler
==13909== Copyright (C) 2002-2010, and GNU GPL'd, by Josef Weidendorfer et al.
==13909== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info
==13909== Command: ./threads
==13909==
==13909== For interactive control, run 'callgrind_control -h'.
create thread
-- thread waiting on condition
cancel thread
==13909==
==13909== Process terminating with default action of signal 11 (SIGSEGV)
==13909==  Access not within mapped region at address 0xE9AFA568
==13909==    at 0x41160A9: madvise (syscall-template.S:82)
==13909==    by 0x4119A0D: clone (clone.S:130)
==13909==  If you believe this happened as a result of a stack
==13909==  overflow in your program's main thread (unlikely but
==13909==  possible), you can try to increase the size of the
==13909==  main thread stack using the --main-stacksize= flag.
==13909==  The main thread stack size used in this run was 8388608.
==13909==
--13909-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting
--13909-- si_code=1;  Faulting address: 0x10;  sp: 0x629eff80

valgrind: m_signals.c:2327 (sync_signalhandler_from_kernel): Assertion 'tid != 0' failed.
==13909==    at 0x3801E495: report_and_quit (m_libcassert.c:191)

sched status:
  running_tid=1


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

Comment 4 walter.stocker 2010-07-29 17:36:33 UTC

Some more experiments:

The libc and libpthread used are from the directory /lib/tls/i686/cmov/ as ldd shows (from the package libc6-i686 - a glibc optimized for i686), but kubuntu also installs (different) libc and libpthread in /lib (from package libc6).

I set up a chroot environment with a copy of libc and libpthread from /lib and copied the earlier built valgrind trunk version together with the testcase into it.

In this chroot the callgrind run was successfull!

So this has something to do with the specific libc/libpthread versions in the libc6-i686 package of kubuntu-10.04 (installed in /lib/tls/i686/cmov/)

Comment 5 Julian Seward 2010-10-04 22:12:29 UTC

Do we want to investigate/fix this for 3.6.0 ?  It sounds pretty
obscure to me, considering that we can't reproduce it.

Comment 6 Josef Weidendorfer 2010-10-05 00:18:24 UTC

Hmm.. the only option would be to try to reproduce it by installing
(k)ubuntu-10.04. Does somebody perhaps have a already installed VMWare
image?

[Unfortunately I am quite loaded with other work at the moment, so...]

Comment 7 Peter Maydell 2010-10-05 00:31:37 UTC

FWIW, I can reproduce this on my 32 bit lucid install with current svn (r11395, VEX r2059).

Comment 8 Julian Seward 2010-10-05 22:35:44 UTC

I can reproduce this on 64-bit Ubuntu 10.04, when the program
is built with -m32.  Investigating.

Comment 9 Julian Seward 2010-10-05 23:55:55 UTC

This is a bit difficult because of the various signals, threads, stack
unwinding, and use of segment registers.  The combination of these
confuses GDB somewhat.

Anyway .. it appears to be segfaulting at callgrind/bbcc.c:832:

    CLG_ASSERT(bbcc->rec_array[bbcc->rec_index] == bbcc);

Putting this on the line before

    VG_(printf)("bbcc->rec_index = %u\n", (UInt)bbcc->rec_index);

and re-running, gives many lines like this

  bbcc->rec_index = 0

sometimes the value is 1 or 2, until just before the segfault:

  bbcc->rec_index = 68534088

which doesn't seem like a valid array index.

---

The history of the thread is:

* SIGRT0 is delivered to it (on the guest CPU)

* it (the signal handler) executes many instructions on the guest CPU,
  which unwind the stack.  The handler does not return; I guess
  it is longjmping or something like that (unwinding the stack
  in small steps, maybe?)

* I think this is happening after the unwind has finished

I'll put details of the exact block in the next comment.

Josef, does this make any sense to you?  Any other info I can
collect?  I have a vague memory that there is/was some other
Callgrind-unwind related segfault bug somewhere .. is that so?

Comment 10 Julian Seward 2010-10-05 23:59:56 UTC

This is where the guest is when the fault happens.  (At least, this is
where I _think_ it is :-)

bbcc->rec_index = 0
bbcc->rec_index = 0
bbcc->rec_index = 0
bbcc->rec_index = 0
bbcc->rec_index = 0
==== SB 2899 [tid 2] madvise(0x48b81e0) SBs exec'd 62349 ====

------------------------ Front end ------------------------

        0x48B81E0:  pushl %ebx

              ------ IMark(0x48B81E0, 1) ------
              t0 = GET:I32(12)
              t1 = Sub32(GET:I32(16),0x4:I32)
              PUT(16) = t1
              STle(t1) = t0

        0x48B81E1:  movl 16(%esp,,),%edx

              ------ IMark(0x48B81E1, 4) ------
              PUT(60) = 0x48B81E1:I32
              t2 = Add32(GET:I32(16),0x10:I32)
              PUT(8) = LDle:I32(t2)

        0x48B81E5:  movl 12(%esp,,),%ecx

              ------ IMark(0x48B81E5, 4) ------
              PUT(60) = 0x48B81E5:I32
              t3 = Add32(GET:I32(16),0xC:I32)
              PUT(4) = LDle:I32(t3)

        0x48B81E9:  movl 8(%esp,,),%ebx

              ------ IMark(0x48B81E9, 4) ------
              PUT(60) = 0x48B81E9:I32
              t4 = Add32(GET:I32(16),0x8:I32)
              PUT(12) = LDle:I32(t4)

        0x48B81ED:  movl $0xDB,%eax

              ------ IMark(0x48B81ED, 5) ------
              PUT(60) = 0x48B81ED:I32
              PUT(0) = 0xDB:I32

        0x48B81F2:  call*l %gs:(0x10)

              ------ IMark(0x48B81F2, 7) ------
              PUT(60) = 0x48B81F2:I32
              t6 = 16Uto32(GET:I16(288))
              t7 = GET:I32(292)
              t8 = GET:I32(296)
              t9 = x86g_use_seg_selector{0x380cb150}(t7,t8,t6,0x10:I32):I64
              if (CmpNE32(64HIto32(t9),0x0:I32)) goto {MapFail} 0x48B81F2:I32
              t10 = 64to32(t9)
              t5 = LDle:I32(t10)
              t11 = Sub32(GET:I32(16),0x4:I32)
              PUT(16) = t11
              STle(t11) = 0x48B81F9:I32
              goto {Call} t5

GuestBytes 48B81E0 25  53 8B 54 24 10 8B 4C 24 0C 8B 5C 24 08 B8 DB 00 00 00 65 FF 15 10 00 00 00  037C2B70


------------------------  After tree-building ------------------------

IRSB {
   t0:I32   t1:I32   t2:I32   t3:I32   t4:I32   t5:I32   t6:I32   t7:I32
   t8:I32   t9:I64   t10:I32   t11:I32   t12:I32   t13:I32   t14:I32   t15:I32
   t16:I32   t17:I32   t18:I32   t19:I32   t20:I32   t21:I32   t22:I32   t23:I32
   t24:I16   t25:I64   t26:I1   t27:I32   t28:I32   t29:I32   t30:I32   
   DIRTY 1:I1 ::: setup_bbcc[rp=1]{0x38001c90}(0x4203B58:I32)
   ------ IMark(0x48B81E0, 1) ------
   t12 = Sub32(GET:I32(16),0x4:I32)
   t0 = GET:I32(12)
   PUT(16) = t12
   STle(t12) = t0
   ------ IMark(0x48B81E1, 4) ------
   PUT(60) = 0x48B81E1:I32
   PUT(8) = LDle:I32(Add32(GET:I32(16),0x10:I32))
   ------ IMark(0x48B81E5, 4) ------
   PUT(60) = 0x48B81E5:I32
   PUT(4) = LDle:I32(Add32(GET:I32(16),0xC:I32))
   ------ IMark(0x48B81E9, 4) ------
   PUT(60) = 0x48B81E9:I32
   PUT(12) = LDle:I32(Add32(GET:I32(16),0x8:I32))
   ------ IMark(0x48B81ED, 5) ------
   PUT(60) = 0x48B81ED:I32
   PUT(0) = 0xDB:I32
   ------ IMark(0x48B81F2, 7) ------
   PUT(60) = 0x48B81F2:I32
   t25 = x86g_use_seg_selector{0x380cb150}(GET:I32(292),GET:I32(296),16Uto32(GET:I16(288)),0x10:I32):I64
   STle(0x38872854:I32) = 0x0:I32
   if (CmpNE32(64HIto32(t25),0x0:I32)) goto {MapFail} 0x48B81F2:I32
   t29 = Sub32(GET:I32(16),0x4:I32)
   t5 = LDle:I32(64to32(t25))
   PUT(16) = t29
   STle(t29) = 0x48B81F9:I32
   STle(0x38872854:I32) = 0x1:I32
   goto {Call} t5
}

bbcc->rec_index = 0
bbcc->rec_index = 68534088
--3017-- sync signal handler: signal=11, si_code=1, EIP=0x4400830, eip=0x380020f9, from kernel
--3017-- SIGSEGV: si_code=1 faultaddr=0x1476f328 tid=2 ESP=0x7ba53a4 seg=NULL

Note: "eip=0x380020f9" tells us where in callgrind-x86-linux the
segfault is.

Comment 11 Josef Weidendorfer 2010-10-06 01:39:53 UTC

> sometimes the value is 1 or 2, until just before the segfault:
>  bbcc->rec_index = 68534088
> which doesn't seem like a valid array index.

Yup. It looks like bbcc->rec_index somehow is not
initialized. However, I just rechecked all the paths allocating
a BBCC struct (get_bbcc/clone_bbcc), and all set this field.

It could be some array over/underrun. Hmm. rec_index is
in the middle of the struct.

You can add the following assertion before the assertion with
the segfault:
  CLG_ASSERT(idx == bbcc->rec_index);
I suppose that this will fail exactly before the segfault.

It would be good to see the values of
* idx
* top->separate_recursions
* top->name
* bbcc->rec_array[0]->cxt->fn[0]->name
directly before the failed assertion.

If the last two strings (function names) are different, I
think I know the problem...

Comment 12 Julian Seward 2010-10-06 02:04:40 UTC

(In reply to comment #11)
> You can add the following assertion before the assertion with
> the segfault:
>   CLG_ASSERT(idx == bbcc->rec_index);
> I suppose that this will fail exactly before the segfault.

Yes, it does.

> It would be good to see the values of
> * idx
> * top->separate_recursions
> * top->name
> * bbcc->rec_array[0]->cxt->fn[0]->name
> directly before the failed assertion.

I can get the first 3, but the last one fails because ->cxt does not
seem like a valid address, so cxt->fn[0] segfaults:

  bbcc->rec_index = 68533904
    idx      = 0
    top->s_r = 2
    top->name = _dl_sysinfo_int80
    bbcc->rec_array = 0x41FF468
    bbcc->rec_array[0]->cxt = 0x593E

Comment 13 Julian Seward 2010-10-06 02:17:31 UTC

(In reply to comment #12)
>     top->name = _dl_sysinfo_int80

Hmm, _dl_sysinfo_int80 is treated specially by the x86-linux stack
unwinder.  Does that maybe have significance here?  See comment at
m_stacktrace.c:698 for what is going on.

-----

One thing that's strange with the crash is that, this diff

  --- callgrind/bbcc.c    (revision 11395)
  +++ callgrind/bbcc.c    (working copy)
  @@ -829,6 +829,15 @@
       else
         bbcc = clone_bbcc(bbcc, CLG_(current_state).cxt, idx);
 
  +    VG_(printf)("bbcc->rec_index = %u\n", (UInt)bbcc->rec_index);
  +    if (!(idx == bbcc->rec_index)) {
  +       VG_(printf)("  idx      = %llu\n", (ULong)idx);
  +       VG_(printf)("  top->s_r = %llu\n", (ULong)top->separate_recursions);
  +       VG_(printf)("  top->name = %s\n", top->name);
  +       VG_(printf)("  bbcc->rec_array = %p\n", bbcc->rec_array);
  +       VG_(printf)("  bbcc->rec_array[0]->cxt = %p\n", bbcc->rec_array[0]->cxt);
  +    }
  +    CLG_ASSERT(idx == bbcc->rec_index);
       CLG_ASSERT(bbcc->rec_array[bbcc->rec_index] == bbcc);
     }

produces this output

  bbcc->rec_index = 0
  bbcc->rec_index = 68533904
    idx      = 0
    top->s_r = 2
    top->name = _dl_sysinfo_int80
    bbcc->rec_array = 0x41FF468
    bbcc->rec_array[0]->cxt = 0x593E
  _dl_sysinfo_int80
  BB# 62350

  Callgrind: bbcc.c:840 (vgCallgrind_setup_bbcc): Assertion 'idx == bbcc->rec_index' failed.
  ==7599==    at 0x3801E5DD: report_and_quit (m_libcassert.c:193)
  ==7599==    by 0x3801E7A7: vgPlain_assert_fail (m_libcassert.c:267)
  ==7599==    by 0x38002EFD: vgCallgrind_setup_bbcc (bbcc.c:840)
  ==7599==    by 0x501D563: ???

  sched status:
    running_tid=2

  Thread 1: status = VgTs_WaitSys
  ==7599==    at 0x47D8B5D: pthread_join (pthread_join.c:89)
  ==7599==    by 0x804875E: main (cancelcrash.c:37)
  
  Thread 2: status = VgTs_Runnable
  ==7599==    at 0x48B81F9: madvise (syscall-template.S:82)
  ==7599==    by 0x48BBB5D: clone (clone.S:130)

What is strange is, where did these 2 lines

  _dl_sysinfo_int80
  BB# 62350

come from?  I didn't ask for them to be printed.

Comment 14 Josef Weidendorfer 2010-10-06 02:39:55 UTC

Oh wow, remote printf debugging is complicated.

> What is strange is, where did these 2 lines
>
>  _dl_sysinfo_int80
>  BB# 62350
>
> come from?  I didn't ask for them to be printed.

That is output from CLG_ASSERT, to help debugging.

Anyway, the BBCC structure seems to be corrupted, and my
assumption about the problem was wrong. You should be
able to verify that
  bbcc == bbcc->rec_array[0]
and bbcc->rec_array[0]->cxt is just another field in the
corrupted data :(

From comment #9:
> I have a vague memory that there is/was some other
> Callgrind-unwind related segfault bug somewhere .. is that so?

Yeah, there is one with running wine in Callgrind :(
That one also seems to have something to do with longjumping...

Comment 15 Josef Weidendorfer 2010-10-06 03:09:10 UTC

I just installed Ubuntu 10.04 in VMPlayer, and I can reproduce the failure...

Comment 16 Julian Seward 2011-01-10 20:43:31 UTC

Josef, any progress on this?

Comment 17 Josef Weidendorfer 2011-02-03 01:01:25 UTC

Argh... I can reproduce it on the command line, but not inside of gdb :(
Any tips? It just hangs after

 ...
 -- thread waiting on condition
 cancel thread
 [New LWP 16225]

The last line is from gdb.

Anyway, if I run with --separate-recs=0 (the default is 2), the bug does
not appear. That is expected, as the original line which triggered the
bug is not executed.
However, using --separate-recs=1 produces another segfault when dumping the
profile file. So, obviously, there is data corruption happening somewhere,
and this triggers the segfault later.

I suppose using hardware breakpoints to observe access to the address of
the data that is getting corrupted would be good. But for that I need a
native Ubuntu installation.

Or inserting a regular check for the structures to be sane, as you do.
Now this bug worries me...

Comment 18 Julian Seward 2011-02-03 01:33:21 UTC

(In reply to comment #17)
> Argh... I can reproduce it on the command line, but not inside of gdb :(
> Any tips? It just hangs after

Urr.  I see the same (reproduces on cmdline, but hangs when I attach GDB
to the process) even in a native Ubuntu 10.04.1.  So it's not a VM related
problem.  (+ I thought h/w watchpoints work ok in more recent vmwares, no?)


> Or inserting a regular check for the structures to be sane,

That might be the easiest and least stressful approach, plus it seems
like you don't have many other options now :)

Comment 19 Julian Seward 2011-02-03 10:23:21 UTC

I tried the patch below, which adds two 32-bit magic numbers to struct _BBCC,
so I could catch cases where a BBCC* obviously does not point at a valid BBCC.

With this in place, it can be made to assert rather than segfault:

IDX = 0
_dl_sysinfo_int80
BB# 60646

Callgrind: bbcc.c:840 (vgCallgrind_setup_bbcc): Assertion 'bbcc->magic1 == 0x31415927' failed.
==8576==    at 0x3801EB5D: report_and_quit (m_libcassert.c:193)
==8576==    by 0x3801ED27: vgPlain_assert_fail (m_libcassert.c:267)
==8576==    by 0x38003246: vgCallgrind_setup_bbcc (bbcc.c:840)
==8576==    by 0x502364B: ???

where:

    CLG_ASSERT(bbcc->magic1 == 0x31415927);
    CLG_ASSERT(bbcc->magic2 == 0x17724539);
    idx = level -1;
    if (bbcc->rec_array[idx]) {
      VG_(printf)("IDX = %u\n", idx);
      bbcc = bbcc->rec_array[idx];
      CLG_ASSERT(bbcc->magic1 == 0x31415927); <----- LINE 840
      CLG_ASSERT(bbcc->magic2 == 0x17724539);
    } else {
      bbcc = clone_bbcc(bbcc, CLG_(current_state).cxt, idx);
      CLG_ASSERT(bbcc->magic1 == 0x31415927);
      CLG_ASSERT(bbcc->magic2 == 0x17724539);
    }
    VG_(printf)("QQQ %u\n", bbcc->rec_index);
    CLG_ASSERT(bbcc->magic1 == 0x31415927);
    CLG_ASSERT(bbcc->magic2 == 0x17724539);
    CLG_ASSERT(bbcc->rec_array[bbcc->rec_index] == bbcc);
  }

So it seems like the bbcc before this if (..) is OK, but the value that
comes from bbcc->rec_array[0] is invalid.  (I know IDX == 0 from the
debug printing just before the assert.)  I didn't chase it any further.

Patch:

Index: callgrind/global.h
===================================================================
--- callgrind/global.h	(revision 11516)
+++ callgrind/global.h	(working copy)
@@ -385,6 +385,7 @@
  * They are distinguishable by their tag field.
  */
 struct _BBCC {
+  UInt magic1;
     BB*      bb;           /* BB for this cost center */
 
     Context* cxt;          /* execution context of this BBCC */
@@ -406,6 +407,7 @@
     BBCC*    next;         /* entry chain in hash */
     ULong*   cost;         /* start of 64bit costs for this BBCC */
     ULong    ecounter_sum; /* execution counter for first instruction of BB */
+  UInt magic2;
     JmpData  jmp[0];
 };
 
Index: callgrind/bbcc.c
===================================================================
--- callgrind/bbcc.c	(revision 11516)
+++ callgrind/bbcc.c	(working copy)
@@ -276,6 +276,9 @@
    bbcc = (BBCC*)CLG_MALLOC("cl.bbcc.nb.1",
 			    sizeof(BBCC) +
 			    (bb->cjmp_count+1) * sizeof(JmpData));
+   VG_(memset)(bbcc, 0, sizeof(BBCC) + (bb->cjmp_count+1) * sizeof(JmpData));
+   bbcc->magic1 = 0x31415927;
+   bbcc->magic2 = 0x17724539;
    bbcc->bb  = bb;
    bbcc->tid = CLG_(current_tid);
 
@@ -578,6 +581,11 @@
   last_bbcc = CLG_(current_state).bbcc;
   last_bb = last_bbcc ? last_bbcc->bb : 0;
 
+  if (last_bbcc) {
+    CLG_ASSERT(last_bbcc->magic1 == 0x31415927);
+    CLG_ASSERT(last_bbcc->magic2 == 0x17724539);
+  }
+
   if (last_bb) {
       passed = CLG_(current_state).jmps_passed;
       CLG_ASSERT(passed <= last_bb->cjmp_count);
@@ -823,12 +831,22 @@
       *CLG_(get_fn_entry)(top->number) = 1;
     }
 
+    CLG_ASSERT(bbcc->magic1 == 0x31415927);
+    CLG_ASSERT(bbcc->magic2 == 0x17724539);
     idx = level -1;
-    if (bbcc->rec_array[idx])
+    if (bbcc->rec_array[idx]) {
+      VG_(printf)("IDX = %u\n", idx);
       bbcc = bbcc->rec_array[idx];
-    else
+      CLG_ASSERT(bbcc->magic1 == 0x31415927);
+      CLG_ASSERT(bbcc->magic2 == 0x17724539);
+    } else {
       bbcc = clone_bbcc(bbcc, CLG_(current_state).cxt, idx);
-
+      CLG_ASSERT(bbcc->magic1 == 0x31415927);
+      CLG_ASSERT(bbcc->magic2 == 0x17724539);
+    }
+    VG_(printf)("QQQ %u\n", bbcc->rec_index);
+    CLG_ASSERT(bbcc->magic1 == 0x31415927);
+    CLG_ASSERT(bbcc->magic2 == 0x17724539);
     CLG_ASSERT(bbcc->rec_array[bbcc->rec_index] == bbcc);
   }

Comment 20 Josef Weidendorfer 2011-02-03 14:49:17 UTC

Thanks.

> > Or inserting a regular check for the structures to be sane,
> That might be the easiest and least stressful approach, plus it seems
> like you don't have many other options now :)

I do exactly that now (sanity check on all BBCC structs existing), and can iteratively pinpoint at smaller ranges of code where the corruption happens.
The good thing is that the test case is very reproducable. The sanity check
looks like that:

void CLG_(check_bbcc)(BBCC* bbcc)
{
  CLG_ASSERT(bbcc->cxt != 0);
  CLG_ASSERT(bbcc->rec_index < 10);
  CLG_ASSERT(bbcc->rec_array[bbcc->rec_index] == bbcc);
}

and then put at any place

  CLG_(forall_bbccs)(CLG_(check_bbcc));

and for the test case this is enough. Of course, magic words are better.

It looks like a buffer underrun when synchronizing Callgrinds shadow stack
because of the longjmp. But any further search has to wait til today evening.

Comment 21 Josef Weidendorfer 2011-02-04 03:04:13 UTC

Ok, I found the place of the corruption.
It is easy to put an assert there, such that instead of the
corruption, the assertion fails.

The following workaround both avoids a failed assertion and
the corruption (that should be fine for 3.6.1), but I like
to dig a bit more to understand the problem, at least for 3.7.

--- a/callgrind/jumps.c
+++ b/callgrind/jumps.c
@@ -160,6 +160,8 @@ static jCC* new_jcc(BBCC* from, UInt jmp, BBCC* to)
     * This list is only used at dumping time */
 
    if (from) {
+       /* FIXME: Should be CLG_ASSERT(jmp <= from->bb->cjmp_count */
+       if (jmp > from->bb->cjmp_count) jmp = from->bb->cjmp_count;
        jcc->next_from = from->jmp[jmp].jcc_list;
        from->jmp[jmp].jcc_list = jcc;
    }

Comment 22 Josef Weidendorfer 2011-02-04 22:02:52 UTC

Just commited r11523 and r11524 to fix this bug.

Can somebody confirm the fix?

r11523 makes the bug appear as failed assertion, and a lot easier to
reproduce. However, it still does not show up in gdb, no idea why.

r11524 changes the behavior all call graph building for longjmps. Before,
a longjump resulted in pops of stack frames, but additionally in a call
into the target frame, which makes no sense. This indirectly avoids use
of local variables which became invalid on unwinding. Use of these invalid
local vars was the real bug...

Julian: for backporting into 3.6.1, r11524 is enough. But r11523 is still
nice to have in 3.6.1 to better catch similar bugs.

This does not include a big patch for regularly checking all Callgrind structs
to be sane. Will do if I have more time.

Comment 23 walter.stocker 2011-02-07 16:34:35 UTC

(In reply to comment #22)
> Just commited r11523 and r11524 to fix this bug.
> 
> Can somebody confirm the fix?
> 

Tested the valgrind trunk revision on kubuntu-10.04 32 Bit and kubuntu-10.10 64 Bit.

The fix works on both systems - thanks.


Then I tested the same valgrind version on an older system we also have to support (based on LFS, kernel 2.6.17.13, glibc-2.3.5).

On this older system the assertion in jumps.c (r11523) fails with every start - even with a minimal main(){printf("hello world");} application, not using threads at all:

==14357== Callgrind, a call-graph generating cache profiler
==14357== Copyright (C) 2002-2010, and GNU GPL'd, by Josef Weidendorfer et al.
==14357== Using Valgrind-3.7.0.SVN and LibVEX; rerun with -h for copyright info
==14357== Command: ./hello
==14357==
==14357== For interactive control, run 'callgrind_control -h'.
(below main)
BB# 25011

Callgrind: jumps.c:164 (new_jcc): Assertion '(0 <= jmp) && (jmp <= from->bb->cjmp_count)' failed.
==14357==    at 0x3801E395: report_and_quit (m_libcassert.c:193)
==14357==    by 0x3801E4CD: vgPlain_assert_fail (m_libcassert.c:267)
==14357==    by 0x380102CA: vgCallgrind_get_jcc (jumps.c:164)
==14357==    by 0x380030DA: vgCallgrind_push_call_stack (callstack.c:217)
==14357==    by 0x380021FA: vgCallgrind_setup_bbcc (bbcc.c:844)
==14357==    by 0x62C231F3: ???

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable
==14357==    at 0x4036D40: (below main) (in /lib/libc-2.3.5.so)


Note: see also the FAQ in the source distribution.
...


When I remove the assertion the testcase works again (as it did with valgrind-3.5.0 - the version we at the moment use on this older system).

Comment 24 Josef Weidendorfer 2011-02-08 01:13:24 UTC

> Tested the valgrind trunk revision on kubuntu-10.04 32 Bit and kubuntu-10.10
> 64 Bit.
> The fix works on both systems - thanks.

Very good, thanks!

> Then I tested the same valgrind version on an older system we also have to
> support (based on LFS, kernel 2.6.17.13, glibc-2.3.5).
> On this older system the assertion in jumps.c (r11523) fails with
> every start -

Being optimistic, I did not hope that to happen. But I added the assert
exactly to catch such problems. At least this does not appear for the
results of the nightly regression tests sent to the mailing list.

If you leave out the assertion, you should instead set jmp to 0 if the
condition of the assertion fails, to avoid corruption of data. It is by
luck that you see no other problem afterwards.

Anyway, can you open another bug report for that? I suppose the distribution
is still on the net, so I can install it in a VM and try to reproduce.
Still, this bug can stay as fixed.