Bug 487439 - SIGILL in JDK11, JDK17
Summary: SIGILL in JDK11, JDK17
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (show other bugs)
Version: 3.15 SVN
Platform: RedHat Enterprise Linux Linux
: NOR normal
Target Milestone: ---
Assignee: Paul Floyd
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-23 18:35 UTC by Bill Torpey
Modified: 2024-06-24 05:33 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
sample valgrind file (snipped for size) (597.75 KB, text/plain)
2024-05-23 21:46 UTC, Bill Torpey
Details
associated hs_err_pid file (221.58 KB, text/plain)
2024-05-23 21:48 UTC, Bill Torpey
Details
offending instructions (29.64 KB, text/x-csrc)
2024-06-10 15:26 UTC, Bill Torpey
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bill Torpey 2024-05-23 18:35:19 UTC
In the process of upgrading from JDK8 to JDK11 we are seeing consistent crashes when running under valgrind, similar to below. 
 
We've tried a number of builds, including "plain vanilla", Azul and Oracle builds of JDK11 and JDK17. valgrind crashes consistently when running on Xeon 5670 cpu's, but not when running on i7-4770R (both running CentOS 7.6).  

valgrind versions 3.15, 3.19 and 3.23 all exhibit the same behavior.
 
The application appears to run fine on both platforms when NOT running under valgrind.

Note that we're also asking on OpenJDK forums.

TIA for any help/suggestions!


#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00000000170d1795, pid=2190035, tid=2190037
#
# JRE version: OpenJDK Runtime Environment Zulu17.50+19-CA (17.0.11+9) (build 17.0.11+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Zulu17.50+19-CA (17.0.11+9-LTS, compiled mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# v  ~StubRoutines::libmLog
#
# Core dump will be written. Default location: /var/log/cores/%e.core.2190035
#
# If you would like to submit a bug report, please visit:
#   http://www.azul.com/support/
#
 
---------------  S U M M A R Y ------------
 
Command Line: -Xshare:off -Dpython.cachedir.skip=true -enableassertions -XX:+PreserveFramePointer -verbose:gc -Xlog:gc:/home/btorpey/btlogs/abim/gc.log -XX:+PrintGCDetails -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8015 -Xcomp -verbose:class -Dappia.config.class=ConfFac -Dappia.whitelist=/tmp/btorpey/install/BusTalk/master/release/abim/config/whitelist.txt Route /tmp/btorpey/build/bustalk/master/abim/dev/appia MILLIT-10034
 
Host: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz, 24 cores, 188G, Red Hat Enterprise Linux release 8.7 (Ootpa)
Time: Tue May 21 09:03:33 2024 EDT elapsed time: 1111.010074 seconds (0d 0h 18m 31s)
 
---------------  T H R E A D  ---------------
 
Current thread (0x00000000058e1d00):  JavaThread "main" [_thread_in_Java, id=2190037, stack(0x0000000004053000,0x0000000004154000)]
 
Stack: [0x0000000004053000,0x0000000004154000],  sp=0x0000000004136748,  free space=909k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
v  ~StubRoutines::libmLog
J 8066 c1 java.math.BigInteger.<clinit>()V java.base@17.0.11 (1688 bytes) @ 0x000000000fe6c996 [0x000000000fe6c360+0x0000000000000636]
v  ~StubRoutines::call_stub
...
Comment 1 Paul Floyd 2024-05-23 19:33:09 UTC
Do you have any Valgrind logs? The Java logs are meaningless to me, other than SIGILL.
Comment 2 Bill Torpey 2024-05-23 21:46:05 UTC
Created attachment 169752 [details]
sample valgrind file (snipped for size)
Comment 3 Bill Torpey 2024-05-23 21:46:47 UTC
had to truncate the file (actual size ~350MB) -- hope this helps
Comment 4 Bill Torpey 2024-05-23 21:48:34 UTC
Created attachment 169753 [details]
associated hs_err_pid file
Comment 5 Paul Floyd 2024-05-24 09:13:49 UTC
This looks like
66 48 0f 73 f1 0c rex.W psllq xmm1,0xc

Some more from the log

vex amd64->IR: unhandled instruction bytes: 0x66 0x48 0xF 0x73 0xF1 0xC 0x66 0xF 0x70 0xF5
vex amd64->IR:   REX=1 REX.W=1 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0

"psllq" should be handled, but it looks like it's the REX prefix that is causing the problem.
Comment 6 Bill Torpey 2024-05-24 11:59:05 UTC
The odd thing is that the SIGILL is only triggered on the X5670 CPU -- the i7-4770R does not trigger the SIGILL (exact same code, OS, etc.)
Comment 7 Paul Floyd 2024-05-24 12:35:36 UTC
That's not unusual. There may be some kind of runtime instruction set selection based on hwcaps/cpuid.
Comment 8 Mark Wielaard 2024-05-24 13:17:53 UTC
(In reply to Paul Floyd from comment #5)
> This looks like
> 66 48 0f 73 f1 0c rex.W psllq xmm1,0xc
> 
> Some more from the log
> 
> vex amd64->IR: unhandled instruction bytes: 0x66 0x48 0xF 0x73 0xF1 0xC 0x66
> 0xF 0x70 0xF5
> vex amd64->IR:   REX=1 REX.W=1 REX.R=0 REX.X=0 REX.B=0
> vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
> vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
> 
> "psllq" should be handled, but it looks like it's the REX prefix that is
> causing the problem.

I think we can just ignore the REX prefix. The intel  64 and IA-32 Architectures Software Developer’s Manual section 10.2.1 "Intel SSE in 64-Bit Mode and Compatibility Mode" says:

"In 64-bit mode, eight additional XMM registers are accessible. Registers XMM8-XMM15 are accessed by using REX prefixes. Memory operands are specified using the ModR/M, SIB encoding described in Section 3.7.5. Some Intel SSE instructions may be used to operate on general-purpose registers. Use the REX.W prefix to access 64-bit general-purpose registers. Note that if a REX prefix is used when it has no meaning, the prefix is ignored."
Comment 9 Paul Floyd 2024-05-25 16:40:20 UTC
Bill Torpey, could you build Valgrind to test a fix?

This change seems to fix your problem:

diff --git a/VEX/priv/guest_amd64_toIR.c b/VEX/priv/guest_amd64_toIR.c
index f0b1c5516..28c37f092 100644
--- a/VEX/priv/guest_amd64_toIR.c
+++ b/VEX/priv/guest_amd64_toIR.c
@@ -14138,7 +14138,7 @@ Long dis_ESC_0F__SSE2 ( Bool* decode_OK,
          goto decode_success;
       }
       /* 66 0F 73 /6 ib = PSLLQ by immediate */
-      if (have66noF2noF3(pfx) && sz == 2 
+      if (have66noF2noF3(pfx) && (sz == 2 || /* ignore redundant REX.W */ sz == 8)
           && epartIsReg(getUChar(delta))
           && gregLO3ofRM(getUChar(delta)) == 6) {
          delta = dis_SSE_shiftE_imm( pfx, delta, "psllq", Iop_ShlN64x2 );

Mark, do you know how to force using rex.W in assembler?

This testcase also seems to work

/* with redundant rex.W */
static void psllq_4(void)
{
   reg128_t arg1 = { .uq = { 0x0123456789abcdefULL, 0x0123456789abcdefULL } };
   reg128_t result0;
   char state[108];

   if (sigsetjmp(catchpoint, 1) == 0)
   {
      asm(
         "ffree %%st(7)\n"
         "ffree %%st(6)\n"
         "ffree %%st(5)\n"
         "ffree %%st(4)\n"
         "movlps %2, %%xmm1\n"
         "movhps %3, %%xmm1\n"
         //".rex.W psllq $12, %%xmm1\n"
         ".byte 0x66,0x48,0x0f,0x73,0xf1,0x0c\n"
         "movlps %%xmm1, %0\n"
         "movhps %%xmm1, %1\n"
         "cld\n"
         : "=m" (result0.uq[0]), "=m" (result0.uq[1])
         : "m" (arg1.uq[0]), "m" (arg1.uq[1]), "m" (state[0])
         : "xmm1"
      );

      if (result0.uq[0] == 0x3456789abcdef000ULL && result0.uq[1] == 0x3456789abcdef000ULL )
      {
         printf("psllq_4 ... ok\n");
      }
      else
      {
         printf("psllq_4 ... not ok\n");
         printf("  result0.uq[0] = %llu (expected %llu)\n", result0.uq[0], 0x3456789abcdef000ULL);
         printf("  result0.uq[1] = %llu (expected %llu)\n", result0.uq[1], 0x3456789abcdef000ULL);
      }
   }
   else
   {
      printf("psllq_4 ... failed\n");
   }

   return;
}

but it doesn't fit in with the gen_insn_test.pl script.
Comment 10 Mark Wielaard 2024-05-26 11:07:37 UTC
(In reply to Paul Floyd from comment #9)
> diff --git a/VEX/priv/guest_amd64_toIR.c b/VEX/priv/guest_amd64_toIR.c
> index f0b1c5516..28c37f092 100644
> --- a/VEX/priv/guest_amd64_toIR.c
> +++ b/VEX/priv/guest_amd64_toIR.c
> @@ -14138,7 +14138,7 @@ Long dis_ESC_0F__SSE2 ( Bool* decode_OK,
>           goto decode_success;
>        }
>        /* 66 0F 73 /6 ib = PSLLQ by immediate */
> -      if (have66noF2noF3(pfx) && sz == 2 
> +      if (have66noF2noF3(pfx) && (sz == 2 || /* ignore redundant REX.W */
> sz == 8)
>            && epartIsReg(getUChar(delta))
>            && gregLO3ofRM(getUChar(delta)) == 6) {
>           delta = dis_SSE_shiftE_imm( pfx, delta, "psllq", Iop_ShlN64x2 );
> 
> Mark, do you know how to force using rex.W in assembler?

It looks like binutils gas allows you to use rex.W as prefix, so:
rex.W psllq $12, %xmm1

The documentation is a little confusing though (I am not sure whether W should be seen as X, Y and Z extension bit or not):
https://sourceware.org/binutils/docs/as/i386_002dPrefixes.html

The ‘rex’ family of prefixes is used by x86-64 to encode extensions to i386 instruction set. The ‘rex’ prefix has four bits — an operand size overwrite (64) used to change operand size from 32-bit to 64-bit and X, Y and Z extensions bits used to extend the register set.

You may write the ‘rex’ prefixes directly. The ‘rex64xyz’ instruction emits ‘rex’ prefix with all the bits set. By omitting the 64, x, y or z you may write other prefixes as well. Normally, there is no need to write the prefixes explicitly, since gas will automatically generate them based on the instruction operands.
Comment 11 Paul Floyd 2024-05-26 13:38:08 UTC
That doesn't work with clang unfortunately and I have to resort to the byte array.
Comment 12 Bill Torpey 2024-05-28 13:27:52 UTC
Hi all -- back from the long holiday here in US.  Will try suggestions and post back with results.  Thanks for all the help!
Comment 13 Bill Torpey 2024-05-30 15:30:49 UTC
Been testing for the past couple days, and it appears that the patch from https://bugs.kde.org/show_bug.cgi?id=487439#c10 is working fine for us -- thanks!
Comment 14 Paul Floyd 2024-06-01 06:08:41 UTC
(In reply to Bill Torpey from comment #13)
> Been testing for the past couple days, and it appears that the patch from
> https://bugs.kde.org/show_bug.cgi?id=487439#c10 is working fine for us --
> thanks!

Good. It might take me a while to finalize the change - I'd like to see if this applies to other opcodes and write a testcase.
Comment 15 Paul Floyd 2024-06-10 14:45:34 UTC
Do you have any way of telling which compiler was used to produce the binary that is generating sigill?
Comment 16 Bill Torpey 2024-06-10 15:25:16 UTC
The offending instructions are not compiled - they are defined as inline asm in OpenJDK, specifically in src/hotspot/cpu/x86/stubGenerator_x86_64_log.cpp (attached).  

The only thing I can tell from the executable file (/usr/lib/jvm/java-11-openjdk-11.0.4.11-0.el7_6.x86_64/bin/java)  is that it has a dependency on GLIBC_2.2.5.

FWIW, we get similar results with OpenJDK 17, as well as with Azul build of OpenJDK 11.
Comment 17 Bill Torpey 2024-06-10 15:26:38 UTC
Created attachment 170348 [details]
offending instructions
Comment 18 Paul Floyd 2024-06-16 19:28:07 UTC
commit c19d19d34a6dadaf4a9d590f516f813e9cbacdd0
Author: Paul Floyd <pjfloyd@wanadoo.fr>
Date:   Sun Jun 16 09:25:51 2024 +0200

    Bug 487439 - SIGILL in JDK11, JDK17
Comment 19 Mark Wielaard 2024-06-22 21:12:20 UTC
See commit in comment #18