489221 – Unrecognized instruction: _mm256_cvtps_ph (vcvtps2ph)

Bug 489221 - Unrecognized instruction: _mm256_cvtps_ph (vcvtps2ph)

Summary: Unrecognized instruction: _mm256_cvtps_ph (vcvtps2ph)

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	vex (other bugs)
Version First Reported In:	3.19.0
Platform:	unspecified Unspecified

Importance:	NOR normal
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-06-26 09:56 UTC by Steve Hill
Modified:	2024-07-02 08:36 UTC (History)
CC List:	4 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
Minimal reproducer for the issue (518 bytes, text/plain) 2024-06-26 09:56 UTC, Steve Hill	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Steve Hill 2024-06-26 09:56:32 UTC

Created attachment 171003 [details]
Minimal reproducer for the issue

SUMMARY

I am working on code that uses half-precision floating point numbers as a storage format. However, when I run Valgrind on the application, it fails with Unrecognized instruction on the instructions that do the float to half conversion.

STEPS TO REPRODUCE
1. Save the attached reproducer `repro.c`
2. Compile with `gcc repro.c -march=haswell`
3. Run `valgrind ./a.out`

OBSERVED RESULT

(With Valgrind 3.16 but have confirmed the same issue exists with 3.19):

$ valgrind ./a.out 
==1257== Memcheck, a memory error detector
==1257== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1257== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==1257== Command: ./a.out
==1257== 
vex amd64->IR: unhandled instruction bytes: 0xC4 0xE3 0x7D 0x1D 0xC0 0x0 0xC5 0xF9 0x7F 0x44
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=1 VEX.nVVVV=0x0 ESC=0F3A
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==1257== valgrind: Unrecognised instruction at address 0x109182.
==1257==    at 0x109182: FloatToHalf (in /data/vagrant/valgrind/a.out)
==1257==    by 0x10922A: main (in /data/vagrant/valgrind/a.out)
==1257== Your program just tried to execute an instruction that Valgrind
==1257== did not recognise.  There are two possible reasons for this.
==1257== 1. Your program has a bug and erroneously jumped to a non-code
==1257==    location.  If you are running Memcheck and you just saw a
==1257==    warning about a bad jump, it's probably your program's fault.
==1257== 2. The instruction is legitimate but Valgrind doesn't handle it,
==1257==    i.e. it's Valgrind's fault.  If you think this is the case or
==1257==    you are not sure, please let us know and we'll try to fix it.
==1257== Either way, Valgrind will now raise a SIGILL signal which will
==1257== probably kill your program.
==1257== 
==1257== Process terminating with default action of signal 4 (SIGILL)
==1257==  Illegal opcode at address 0x109182
==1257==    at 0x109182: FloatToHalf (in /data/vagrant/valgrind/a.out)
==1257==    by 0x10922A: main (in /data/vagrant/valgrind/a.out)
==1257== 
==1257== HEAP SUMMARY:
==1257==     in use at exit: 0 bytes in 0 blocks
==1257==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==1257== 
==1257== All heap blocks were freed -- no leaks are possible
==1257== 
==1257== For lists of detected and suppressed errors, rerun with: -s
==1257== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction

EXPECTED RESULT

The instruction should be correctly handled

SOFTWARE/OS VERSIONS
Windows: 
macOS: 
Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

ADDITIONAL INFORMATION

Comment 1 Mark Wielaard 2024-06-26 12:17:28 UTC

The m256 variant is part of AVX512F

Comment 2 Tom Hughes 2024-06-26 12:34:54 UTC

It's a dupe of 383010 then.

*** This bug has been marked as a duplicate of bug 383010 ***

Comment 3 Steve Hill 2024-06-26 14:08:22 UTC

According to Intel, this is not part of AVX512F. The repro targets Haswell, which is over 10 years old and does not support any AVX512 at all but still supports this as it is part of F16C:

Synopsis

__m128i _mm256_cvtps_ph (__m256 a, int imm8)
#include <immintrin.h>
Instruction: vcvtps2ph xmm, ymm, imm8
CPUID Flags: F16C

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed half-precision (16-bit) floating-point elements, and store the results in dst.

Rounding is done according to the imm8[2:0] parameter, which can be one of:
    _MM_FROUND_TO_NEAREST_INT // round to nearest
    _MM_FROUND_TO_NEG_INF     // round down
    _MM_FROUND_TO_POS_INF     // round up
    _MM_FROUND_TO_ZERO        // truncate
    _MM_FROUND_CUR_DIRECTION  // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Hence, I am not convinced that it is a duplicate of bug 383010 as that seems to be AVX512-specific? Of course, if it gets done, I don't really mind that much!

Comment 4 Mark Wielaard 2024-06-26 15:02:28 UTC

Sorry, looks like you are right. I was confused with the m512 variant, which is AVX512F. The m256 variant is part of F16C.

Comment 5 Mark Wielaard 2024-06-30 19:24:18 UTC

This was implemented in:

commit 472b067e39a11a47ae3fa7cd7d3142558f78969d
Author: Julian Seward <jseward@acm.org>
Date:   Sun Mar 17 21:41:42 2019 +0100

    amd64: Implement RDRAND, VCVTPH2PS and VCVTPS2PH.
    
    Bug 398870 - Please add support for instruction vcvtps2ph
    Bug 353370 - RDRAND amd64->IR: unhandled instruction bytes: 0x48 0xF 0xC7 0xF0
    
    This commit implements:
    
    * amd64 RDRAND instruction, on hosts that have it.
    
    * amd64 VCVTPH2PS and VCVTPS2PH, on hosts that have it.
    
      The presence/absence of these on the host is now reflected in the CPUID
      results returned to the guest.  So code that tests for these features in
      CPUID and acts accordingly should "just work".
    
    * New test cases, none/tests/amd64/rdrand and none/tests/amd64/f16c.  These
      are built if the host's assembler can handle them, in the usual way.

And on my machine the reproducer works just fine under valgrind.

If it still doesn't for you with latest valgrind could you run with valgrind -v and check the hwcaps detect f16c:

--411301-- Arch and hwcaps: AMD64, LittleEndian, amd64-cx16-lzcnt-rdtscp-sse3-ssse3-avx-avx2-bmi-f16c-rdrand-rdseed-fma
(and check your machine does actually have f16c set in /proc/cpuinfo)

Comment 6 Steve Hill 2024-07-02 07:39:43 UTC

Argh! You are right. I am running inside a VirtualBox VM and, although the instructions *seem* to work just fine, the feature flag is not set. It looks to be due to this longstanding feature request to add F16C (and FMA/BMI) support to VBox: https://www.virtualbox.org/ticket/15471

Thanks for your help,

Steve.

Comment 7 Paul Floyd 2024-07-02 08:00:10 UTC

Is it OK to close this item then?

Comment 8 Steve Hill 2024-07-02 08:24:49 UTC

Yes, from my side this looks like a VirtualBox issue, not a Valgrind issue, so can be closed here. Thanks!