496552 – intermittent false positives with JNI code?

Bug 496552 - intermittent false positives with JNI code?

Summary: intermittent false positives with JNI code?

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	memcheck (show other bugs)
Version:	3.23.0
Platform:	Compiled Sources Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-11-21 19:05 UTC by Bill Torpey
Modified:	2024-11-27 15:21 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
report file (1.65 MB, application/zip) 2024-11-21 19:07 UTC, Bill Torpey	Details
filtered output (27.64 KB, text/plain) 2024-11-22 16:41 UTC, Bill Torpey	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Bill Torpey 2024-11-21 19:05:48 UTC

We regularly regression-test our code with valgrind, including some JNI code, and those tests always pass at this point.  Except yesterday, when we got a bunch of "Illegal read/write" errors.

I've satisfied myself as best I can that these are false positives by examining the source, but at the same time I don't recall ever seeing a true false positive from valgrind before, so I'm suspicious of that explanation.

I know this is a long shot, but I've attached the report file and I'd be very grateful if someone has the time to take a look and see if anything odd jumps out.  As you can see, we're running memcheck on the JVM, which generates a LOT of output (we use the scripts from https://github.com/nyfix/memleaktest to pare that down to something manageable).  

The bits that are relevant to this discussion are at lines #9655-10497.  

Any suggestions much appreciated -- thanks in advance!

Comment 1 Bill Torpey 2024-11-21 19:07:18 UTC

Created attachment 176017 [details]
report file

Comment 2 Paul Floyd 2024-11-22 06:30:15 UTC

That's a lot of errors. Unless you have solid proof you should assume that the errors are genuine. They may be harmless and/or impossible for you to fix, but that is not the same thing.

Can you get debuginfo?

Errors like

==10:51:28.388424 24239== Conditional jump or move depends on uninitialised value(s)
==10:51:28.388531 24239==    at 0x5520D89: instanceKlass::link_class_impl(instanceKlassHandle, bool, Thread*) (in /build/share/java/jdk-1.6.10_02/jre/lib/amd64/server/libjvm.so)

will be clearer if you can see the line:file info for libjvm.so. If these errors really are benign then you should use a suppression file.

==10:51:28.895572 24239== Invalid write of size 4
==10:51:28.895611 24239==    at 0xD158900: ???
==10:51:28.895617 24239==    by 0xD1473EA: ???
==10:51:28.895635 24239==    by 0x554464C: JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*) (in /build/share/java/jdk-1.6.10_02/jre/lib/amd64/server/libjvm.so)
[snip]
==10:51:28.895740 24239==  Address 0x5f46550 is on thread 2's stack
==10:51:28.895744 24239==  4096 bytes below stack pointer

These are even harder. I assume that the ??? addresses are in the Java generated code. I have no experience in that kind of debugging. Maybe using valgrind+gdb would help (I don't know what extras you can use in gdb). Also these errors are all exact multiples from 1 to 6 of4k pages.

Thirdly there are some invalid reads.

==11:30:09.646556 24239==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
^^^
doesn't that cause a segfault?

Finally errors like
==11:43:12.263417 24239== Invalid read of size 8
==11:43:12.263493 24239==    at 0x4AA74E3: vfprintf (in /usr/lib64/libc-2.17.so)
==11:43:12.263514 24239==    by 0x4AD6F38: vsnprintf (in /usr/lib64/libc-2.17.so)
==11:43:12.263519 24239==    by 0x4AB23C1: snprintf (in /usr/lib64/libc-2.17.so)
==11:43:12.263545 24239==    by 0x19CFD9A2: TransactX::LogFacility::formatLogLine(char*, int, TransactX::MITime&, Transact::LogSeverity, ch
ar const*, char const*, char const*, int, unsigned long, char const*, int, int, char const*) (LogFacility.cpp:151)

That looks like it is in your code. You should be able to examine the source to work out what is wrong.

Comment 3 Bill Torpey 2024-11-22 16:25:09 UTC

Thanks for the quick reply!  Please see my embedded comments below:

> That's a lot of errors. Unless you have solid proof you should assume that
> ...
> errors really are benign then you should use a suppression file.

We do something similar -- using the scripts (https://github.com/nyfix/memleaktest) we filter out any reports that don't have at least some of our own code in the stack trace.

> ==10:51:28.895572 24239== Invalid write of size 4

Since these don't include our code in the stack we just ignore them.  The JVM does some VERY weird stuff, and as you point out the actual Java code is invisible to valgrind, gdb, etc.  That's OK because for this we're only interested in testing the JNI code.

> 
> Thirdly there are some invalid reads.
> ...
> doesn't that cause a segfault?

Funny you should mention it.  It turns out that the JVM catches SEGV's rather than checking for NULL (https://jcdav.is/2015/10/06/SIGSEGV-as-control-flow/), so that's "normal" for the JVM.  One more reason why we ignore anything that doesn't include our code on the stack.

> Finally errors like
> ==11:43:12.263417 24239== Invalid read of size 8
> ==11:43:12.263493 24239==    at 0x4AA74E3: vfprintf (in
> /usr/lib64/libc-2.17.so)
> ...
> That looks like it is in your code. You should be able to examine the source
> to work out what is wrong.

Yup, and I've reviewed the code and as far as I can tell it's clean.  Not to mention we have literally hundreds of tests using both valgrind and Address Sanitizer that say it's clean.  So why all of a sudden does valgrind flag it in this one test?

One question I have is about the following:

==11:43:12.062730 24239==  Address 0x4df05f58 is on thread 61's stack
==11:43:12.062735 24239==  in frame #0, created by snprintf (???:)

What does this mean?  We already know that the address is on thread 61's stack (the current thread, I believe) and that it's in frame #0 (the current top-most frame),  So what's the problem?  

I can only guess that the JVM is doing something funky that is confusing valgrind, but I would very much like to know if you think that is plausible?

Thanks again for taking the time to review!

Comment 4 Bill Torpey 2024-11-22 16:40:34 UTC

FWIW, I've uploaded the filtered output from our scripts -- the java bit is at the end.  So, we don't have to look at or bother with what the JVM itself does (which we can't do anything about in any case).

Comment 5 Bill Torpey 2024-11-22 16:41:29 UTC

Created attachment 176041 [details]
filtered output

Comment 6 Paul Floyd 2024-11-23 17:31:07 UTC

There's still a lot to read there.

Some general comments about when memcheck can at least seem to get things wrong.

I'll start with invalid reads and writes. In order for memcheck to get this wrong it must fail to correctly redirect allocation and deallocation functions or it must fail to have the right information on stack limits. In this case it looks like the stack.

Uninitialized memory is more difficult since there are so many ways of doing things with memory. It's a narrow path to tread between ignoring harmless optimizations, missing genuine errors and generating excessive noise.

Back to invalid stack issues. Debugging this kind of thing isn't straightforward. On the memcheck side you can check the logs and memory maps and debug combinations of Valgrind and the guest exe. On the guest you need to know exactly what the intent of the code is. That's probably the hardest bit. If you aren't a JVM developer then matching what the JVM is doing with what memcheck is seeing and reporting. I don't know if it's possible to mix java debuggers with memcheck running under gdb or memcheck with vgdb or both.

If you can reproduce the issues outside of your test framework I would suggest starting with vgdb. That's usually a great way of poking around with the guest at the moment of an error. You'll probably need to use a suppression file to get past all those initial uninitialized memory errors.

Comment 7 Bill Torpey 2024-11-25 16:08:57 UTC

Thanks again Paul -- comments follow.

> I'll start with invalid reads and writes. In order for memcheck to get this
> wrong it must fail to correctly redirect allocation and deallocation
> functions or it must fail to have the right information on stack limits. In
> this case it looks like the stack.

At this point I think we're going to have to write this off as a false positive caused by the JVM doing something that confused valgrind.  As mentioned earlier, we don't much care about what the JVM itself is doing, simply because there's nothing we can do about it -- not to mention that we simply don't have the time or resources to go down that particular rabbit-hole.  

> If you can reproduce the issues outside of your test framework I would ...

We can't even reproduce the issue *with* our test framework -- it was a one-time event, out of literally thousands of test runs that were totally fine.  Which, of course, is why we're even asking this question in the first place.    

Thanks again for all your help -- I think we're at a dead-end here.  The main thing is it sounds like it *is* possible for valgrind to lose track of the stack, and that is the only explanation that I can come up with for the results we're seeing.

We'll keep on eye on subsequent tests, but frankly I would be surprised if this happens again.

Comment 8 Paul Floyd 2024-11-27 15:21:50 UTC

Please reopen a new item if ever you manage to find a reproducible issue.