Bug 417427

Summary: commit to fix vki_siginfo_t definition created numerous regression errors on PPC64
Product: [Developer tools] valgrind Reporter: Carl Love <cel>
Component: vexAssignee: Julian Seward <jseward>
Status: CLOSED FIXED    
Severity: normal CC: mark
Priority: NOR    
Version First Reported In: unspecified   
Target Milestone: ---   
Platform: Other   
OS: Linux   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:
Attachments: PPC64 alginment fix for struct rt_sigframe

Description Carl Love 2020-02-11 15:35:34 UTC
The regression testing for ppc64 reports a significant number of failed tests with commit:

commit 3bac39a10abf292d332bb20ab58c6dd5c28f9108
Author: Eugene Syromyatnikov <evgsyr@gmail.com>
Date:   Fri Mar 8 04:07:00 2019 +0100

    include/vki: fix vki_siginfo_t definition on amd64, arm64, and ppc64

    As it turned out, the size of vki_siginfo_t is incorrect on these 64-bit
    architectures:

        (gdb) p sizeof(vki_siginfo_t)
        $1 = 136
        (gdb) ptype struct vki_siginfo
        type = struct vki_siginfo {
            int si_signo;
            int si_errno;
            int si_code;
            union {
                int _pad[29];
                struct {...} _kill;
                struct {...} _timer;
                struct {...} _rt;
                struct {...} _sigchld;
                struct {...} _sigfault;
                struct {...} _sigpoll;
            } _sifields;
        }

    It looks like that for this architecture, __VKI_ARCH_SI_PREAMBLE_SIZE
    hasn't been defined properly, which resulted in incorrect
    VKI_SI_PAD_SIZE calculation (29 instead of 28).

        <6a9e4>   DW_AT_name        : (indirect string, offset: 0xcf59): _sifields
        <6a9ef>   DW_AT_data_member_location: 16

    This issue has been discovered with strace's "make check-valgrind-memcheck",
    which produced false out-of-bounds writes on ptrace(PTRACE_GETSIGINFO) calls:

        SYSCALL[24264,1](101) sys_ptrace ( 16898, 24283, 0x0, 0x606bd40 )
        ==24264== Syscall param ptrace(getsiginfo) points to unaddressable byte(s)
        ==24264==    at 0x575C06E: ptrace (ptrace.c:45)
        ==24264==    by 0x443244: next_event (strace.c:2431)
        ==24264==    by 0x443D30: main (strace.c:2845)
        ==24264==  Address 0x606bdc0 is 0 bytes after a block of size 144 alloc'd

    (Note that the address passed is 0x606bd40 and the address reported is
    0x606bdc0).

    After the patch, no such errors observed.

    * include/vki/vki-amd64-linux.h [__x86_64__ && __ILP32__]
    (__vki_kernel_si_clock_t): New typedef.
    [__x86_64__ && __ILP32__] (__VKI_ARCH_SI_CLOCK_T,
    __VKI_ARCH_SI_ATTRIBUTES): New macros.
    [__x86_64__ && !__ILP32__] (__VKI_ARCH_SI_PREAMBLE_SIZE): New macro,
    define to 4 ints.
    * include/vki/vki-arm64-linux.h (__VKI_ARCH_SI_PREAMBLE_SIZE): Likewise.
    * include/vki/vki-ppc64-linux.h [__powerpc64__] (__VKI_ARCH_SI_PREAMBLE_SIZE):
    Likewise.
    * include/vki/vki-linux.h [!__VKI_ARCH_SI_CLOCK_T]
    (__VKI_ARCH_SI_CLOCK_T): New macro, define to vki_clock_t.
    [!__VKI_ARCH_SI_ATTRIBUTES] (__VKI_ARCH_SI_ATTRIBUTES): New macro,
    define to nil.
    (struct vki_siginfo): Use __VKI_ARCH_SI_CLOCK_T type for _utime and
    _stime fields.  Add __VKI_ARCH_SI_ATTRIBUTES.

    Resolves: https://bugs.kde.org/show_bug.cgi?id=405201
    Reported-by: Dmitry V. Levin <ldv@altlinux.org>
    Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>


Prior to the patch the number of failed tests was:
== 647 tests, 3 stderr failures, 0 stdout failures, 0 stderrB failures, 1 stdou\
tB failure, 2 post failures ==                                                  
gdbserver_tests/nlcontrolc               (stdoutB)                              
memcheck/tests/bug340392                 (stderr)                               
memcheck/tests/leak_cpp_interior         (stderr)                               
memcheck/tests/linux/rfcomm              (stderr)                               
massif/tests/new-cpp                     (post)                                 
massif/tests/overloaded-new              (post) 

Currently we have 
== 649 tests, 38 stderr failures, 13 stdout failures, 1 stderrB failure, 5 stdo\
utB failures, 2 post failures ==                                                
gdbserver_tests/mcinfcallRU              (stderr)                               
gdbserver_tests/mcsignopass              (stderr)                               
gdbserver_tests/mcsignopass              (stdoutB)                              
gdbserver_tests/mcsigpass                (stderr)                               
gdbserver_tests/mcsigpass                (stdoutB)                              
gdbserver_tests/nlcontrolc               (stdoutB)                              
gdbserver_tests/nlpasssigalrm            (stderr)                               
gdbserver_tests/nlpasssigalrm            (stdoutB)                              
gdbserver_tests/nlpasssigalrm            (stderrB)                              
gdbserver_tests/nlvgdbsigqueue           (stderr)                               
gdbserver_tests/nlvgdbsigqueue           (stdoutB)                              
memcheck/tests/badjump2                  (stderr)                               
memcheck/tests/bug340392                 (stderr)                               
memcheck/tests/descr_belowsp             (stderr)                               
memcheck/tests/leak_cpp_interior         (stderr)                               
memcheck/tests/linux/rfcomm              (stderr)                               
memcheck/tests/post-syscall              (stderr)                               
memcheck/tests/sigaltstack               (stderr)                               
memcheck/tests/signal2                   (stdout)                               
memcheck/tests/signal2                   (stderr)                               
memcheck/tests/vcpu_fnfns                (stdout)                               
memcheck/tests/vcpu_fnfns                (stderr)                               
helgrind/tests/tc18_semabuse             (stderr)                               
helgrind/tests/tc20_verifywrap           (stderr)                               
drd/tests/pth_cancel_locked              (stderr)                               
drd/tests/sigalrm                        (stderr)                               
drd/tests/sigaltstack                    (stderr)                               
drd/tests/tc18_semabuse                  (stderr)                               
massif/tests/new-cpp                     (post)                                 
massif/tests/overloaded-new              (post)                                 
none/tests/async-sigs                    (stderr)                               
none/tests/bug234814                     (stdout)                               
none/tests/bug234814                     (stderr)                               
none/tests/coolo_sigaction               (stdout)                               
none/tests/coolo_sigaction               (stderr)
none/tests/faultstatus                   (stderr)                               
none/tests/linux/pthread-stack           (stderr)                               
none/tests/pending                       (stdout)                               
none/tests/pending                       (stderr)                               
none/tests/ppc64/test_isa_2_06_part3     (stdout)                               
none/tests/ppc64/test_isa_2_06_part3     (stderr)                               
none/tests/ppc64/test_isa_2_07_part2     (stdout)                               
none/tests/ppc64/test_isa_2_07_part2     (stderr)                               
none/tests/ppc64/tw_td                   (stdout)                               
none/tests/ppc64/tw_td                   (stderr)                               
none/tests/ppc64/twi_tdi                 (stdout)                               
none/tests/ppc64/twi_tdi                 (stderr)                               
none/tests/pth_cancel1                   (stdout)                               
none/tests/pth_cancel1                   (stderr)                               
none/tests/pth_cancel2                   (stderr)                               
none/tests/scripts/shell                 (stdout)                               
none/tests/scripts/shell                 (stderr)                               
none/tests/scripts/shell_valid4          (stderr)                               
none/tests/sigstackgrowth                (stdout)                               
none/tests/sigstackgrowth                (stderr)                               
none/tests/syscall-restart1              (stderr)                               
none/tests/syscall-restart2              (stderr)                               
none/tests/thread-exits                  (stdout)                               
none/tests/thread-exits                  (stderr)
Comment 1 Carl Love 2020-02-11 15:39:31 UTC
Would like to get this issue fixed before the next Valgrind release.

I will look at the commit further to help determine the root cause of the breakage.
Comment 2 Mark Wielaard 2020-02-11 16:02:13 UTC
This is also https://bugs.kde.org/show_bug.cgi?id=416760

The original bug fixed was https://bugs.kde.org/show_bug.cgi?id=405201
Comment 3 Carl Love 2020-02-11 21:05:31 UTC
The following patch seems to clean up most of the issues:

Fix for regression error

---
 coregrind/m_sigframe/sigframe-ppc64-linux.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/coregrind/m_sigframe/sigframe-ppc64-linux.c b/coregrind/m_sigframe/
sigframe-ppc64-linux.c
index 0406f3c..b54c4e0 100644
--- a/coregrind/m_sigframe/sigframe-ppc64-linux.c
+++ b/coregrind/m_sigframe/sigframe-ppc64-linux.c
@@ -112,7 +112,7 @@ struct rt_sigframe {
    vki_siginfo_t         info;
    struct vg_sig_private priv;
    UChar                 abigap[288];   // unused
-};
+} __attribute__ ((aligned (16)));
 
 #define SET_SIGNAL_LR(zztst, zzval)                          \
    do { tst->arch.vex.guest_LR = (zzval);                    \
-- 
2.7.4

With this patch the number of regression errors is reduced to:
== 649 tests, 6 stderr failures, 3 stdout failures, 0 stderrB failures, 2 stdou\
tB failures, 2 post failures ==                                                 
gdbserver_tests/nlgone_abrt              (stdoutB)                              
gdbserver_tests/nlpasssigalrm            (stdoutB)                              
memcheck/tests/bug340392                 (stderr)                               
memcheck/tests/leak_cpp_interior         (stderr)                               
memcheck/tests/linux/rfcomm              (stderr)                               
memcheck/tests/vcpu_fnfns                (stdout)                               
memcheck/tests/vcpu_fnfns                (stderr)                               
massif/tests/new-cpp                     (post)                                 
massif/tests/overloaded-new              (post)                                 
none/tests/ppc64/test_isa_2_06_part3     (stdout)                               
none/tests/ppc64/test_isa_2_06_part3     (stderr)                               
none/tests/ppc64/test_isa_2_07_part2     (stdout)                               
none/tests/ppc64/test_isa_2_07_part2     (stderr)   

Continuing to look at what additional changes are needed.  But it looks like a start.
Comment 4 Carl Love 2020-02-11 22:14:52 UTC
Rolled back to Eugene's commit 3bac39a10abf292d332bb20ab58c6dd5c28f9108 and applied my patch.  That does fix all the regression error at that point.

== 647 tests, 3 stderr failures, 0 stdout failures, 0 stderrB failures, 2 stdou\
tB failures, 2 post failures ==                                                 
gdbserver_tests/nlgone_abrt              (stdoutB)                              
gdbserver_tests/nlpasssigalrm            (stdoutB)                              
memcheck/tests/bug340392                 (stderr)                               
memcheck/tests/leak_cpp_interior         (stderr)                               
memcheck/tests/linux/rfcomm              (stderr)                               
massif/tests/new-cpp                     (post)                                 
massif/tests/overloaded-new              (post)   

So, it looks like there is another commit that breaks a few things later.  Looking....
Comment 5 Carl Love 2020-02-13 20:52:04 UTC
Looks like the few additional errors that the fix given for this bug is for an unrelated commit after Eugene's changes.  

If we can get some testing of the fix on other platforms that would be helpful to decide if this is a good fix or not.  

I am a little concerned how universally the __attribute__ ((aligned (16))) works.

Thanks.
Comment 6 Carl Love 2020-02-21 23:13:13 UTC
Created attachment 126280 [details]
PPC64 alginment fix for struct rt_sigframe

PPC64 alginment fix for struct rt_sigframe

The attached patch fixes the alignment for the rt_sigframe structure on PPC64.  This fixes numerous regresion errors.  On power 7, 8 and 9, it fixes 31 stderr failures, 10 stdout failures.
Comment 7 Carl Love 2020-02-21 23:24:08 UTC
The patch was committed to Valgrind mainline

commit 6f8920fd8f9381cf75f5dd0c3c65ac5d86c0a537
Author: Carl Love <carll@us.ibm.com>
Date:   Fri Feb 21 17:22:26 2020 -0600

    PPC64, fix for alignment of the rt_sigframe data structure.
    
    The PPC64 implementation checks that the data structure is aligned.  The
    changes in commit listed below breaks the alignment.  This patch adds an
    explicit alignment directive to ensure the data structure is allocated
    with the required alignment.  This fixes 31 stderr failures, 10 stdout
    failures on the Power 7, Power 8 and Power 9 platforms.
    
    commit 3bac39a10abf292d332bb20ab58c6dd5c28f9108
    Author: Eugene Syromyatnikov <evgsyr@gmail.com>
    Date:   Fri Mar 8 04:07:00 2019 +0100
Comment 8 Carl Love 2020-02-24 16:24:00 UTC
Nightly regression tests errors were reduced as expected.  No unexpected issues/errors were seen.  Closing.