Bug 396290 - [PATCH] Possible tool - failgrind
Summary: [PATCH] Possible tool - failgrind
Status: CONFIRMED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR wishlist
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-07 22:24 UTC by roger.light
Modified: 2021-10-25 23:11 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Patch against master (37.23 KB, patch)
2018-07-07 22:24 UTC, roger.light
Details
Patch using sha3 for hashing instead of adler32 checksums. (50.58 KB, patch)
2018-07-12 22:38 UTC, roger.light
Details
Updated patch with adler32 checksum. (37.20 KB, patch)
2018-07-29 21:25 UTC, roger.light
Details
Updated patch with fixes and text callstacks. (39.69 KB, patch)
2018-07-31 11:44 UTC, roger.light
Details
Updated patch with UInt fix, plus others (40.07 KB, patch)
2018-07-31 23:43 UTC, Roger Light
Details
Updated patch with better control. (62.18 KB, patch)
2018-08-22 22:45 UTC, roger.light
Details
Updated patch - tests and other improvements (113.11 KB, patch)
2018-08-26 21:31 UTC, roger.light
Details
Patch with function enter/leave detection code from callgrind. (233.62 KB, patch)
2018-09-06 21:37 UTC, roger.light
Details
Patch that allows tools to set a syscall error. (5.49 KB, patch)
2018-10-17 21:16 UTC, roger.light
Details
Patch that implements VG_(apply_ExeContext)(). (3.04 KB, patch)
2018-10-17 21:17 UTC, roger.light
Details
Tool patch October 2018 (318.65 KB, patch)
2018-10-17 21:18 UTC, roger.light
Details
Updated apply_ExeContext after comments. (2.40 KB, patch)
2018-10-19 06:11 UTC, roger.light
Details
Patch 1/4, heap memory allocation failures only (135.12 KB, patch)
2019-03-12 16:28 UTC, Roger Light
Details
Patch 2/4, add VG_(force_syscall_error) (5.49 KB, patch)
2019-03-12 16:29 UTC, Roger Light
Details
Patch 3/4, support for syscall failures (104.47 KB, patch)
2019-03-12 16:31 UTC, Roger Light
Details
Patch 4/4, function tracing code from callgrind (118.23 KB, patch)
2019-03-12 16:35 UTC, Roger Light
Details
Patch against 3.18.1 (326.72 KB, patch)
2021-10-25 23:11 UTC, Roger Light
Details

Note You need to log in before you can comment on or make changes to this bug.
Description roger.light 2018-07-07 22:24:29 UTC
Created attachment 113820 [details]
Patch against master

Hello,

I've done some work on making a valgrind tool to help test heap allocation failures. The idea is that when an allocation is about to happen, the tool checks whether it has seen this call stack before. If it hasn't seen the call stack, then the memory allocation fails, otherwise it succeeds. Call stacks are stored in memory and on disk in the form of a checksum, and each time the tool runs it can load the checksums from disk so carrying out multiple runs of the same program will test different allocation failures.

In theory this means you can test the failure of every single heap allocation in your program, in practice that may be difficult to achieve, especially for complex programs.

I've attached the patch for your consideration. The biggest limitation to my mind is the use of the adler32 checksum function. Ideally I would use an actual hashing function and have a version of this patch that uses a third party sha3 implementation, but didn't want to complicate things at this point. I have't managed to get print-docs working, but I believe that is more down to my environment than anything else.

I look forward to your comments...
Comment 1 roger.light 2018-07-12 22:38:45 UTC
Created attachment 113910 [details]
Patch using sha3 for hashing instead of adler32 checksums.
Comment 2 Philippe Waroquiers 2018-07-29 20:34:36 UTC
I took a (very) quick look at the patch.

IIUC, the tool stores a sha3 of the stack traces for which it
already produced an alloc failure, so as to not make the same
stacktrace fail.
Why not store the stack traces themselves in a file?
I guess we do not mind (too much) about disk space ?
(we now that if we run the same application under e.g. memcheck, the
whole set of stacktraces will fit in valgrind memory, so the size should
be reasonable).

I guess you might then use pub_tool_execontext.h functions to store the stacktraces,and check if a stacktrace is already stored.

Also, probably better to have a few regression tests for a tool to be
merged.
By no way this last comment is a promise that the tool will be merged :).
Comment 3 roger.light 2018-07-29 21:25:36 UTC
Created attachment 114201 [details]
Updated patch with adler32 checksum.

Fixes the case where the checksum was being written to the output file incorrectly  if the first character was a 0.
Comment 4 roger.light 2018-07-29 22:11:35 UTC
Thanks for taking a look. I'm well aware that the set of tools in valgrind is a select bunch, so I'm not expecting a merge.

Your understanding of the tool is correct. The idea of using sha3 for storing the call stack is twofold, firstly that is what I used when I was investigating the idea outside of valgrind, and secondly I'm not that familiar with the valgrind api and wanted to present the idea for discussion earlier rather than later.

Saving the whole call stack to a file would be much more desirable certainly, it gives you much better visibility over what has gone before / caused a crash than with the current implementation. The part that I'm missing is around loading the call stacks in at the start - could it work in a similar fashion to a suppression file, i.e. the file contents are loaded as the call stacks that we should suppress the allocation failure?

I did start to look at regression tests, but didn't want to invest too much time into them before getting a little feedback on the general idea.
Comment 5 Ivo Raisr 2018-07-30 11:44:32 UTC
Thank you for sharing this tool with the Valgrind community.
I can see value of this tool in its systemic way in which it fails the application and taking a central place in an automated solution for checking memory allocation problems.

The idea of storing call stacks instead of hashes is of course welcome, as it is much more user friendly and configurable.

Please could you also fix some small problems in your code, such as using "naked" C types (char), missing space between 'if' and '{'. Also do not include system files (fcntl.h) - use Valgrind's own ones (VKI_O_RDONLY etc.).

As regards '--depth' command line argument, please could you explain why
existing --num-callers is not suitable to use?
Comment 6 roger.light 2018-07-30 15:36:24 UTC
> Please could you also fix some small problems in your code, such as using
> "naked" C types (char), missing space between 'if' and '{'. Also do not
> include system files (fcntl.h) - use Valgrind's own ones (VKI_O_RDONLY etc.).

Certainly, that's no problem.
 
> As regards '--depth' command line argument, please could you explain why
> existing --num-callers is not suitable to use?

The documentation for --num-callers talks about the number of stack being "shown", I wasn't sure about whether what I was doing was overloading that meaning in a way that was confusing.
Comment 7 Ivo Raisr 2018-07-30 15:47:28 UTC
Alright, I think it would be preferable to have --num-callers used instead
of --depth. Users are already familiar with --num-callers.
Comment 8 roger.light 2018-07-31 11:44:26 UTC
Created attachment 114229 [details]
Updated patch with fixes and text callstacks.

This is an updated patch:

* Replaced --depth with standard --num-callers
* Call stacks are now saved as text listings
* External headers removed
* Bare C types replaced with valgrind versions
* Formatting fixes

I haven't yet added any tests.
Comment 9 Philippe Waroquiers 2018-07-31 19:57:22 UTC
I am wondering also how much difficult it would be to somewhat more
generalise the tool.

For example, we might want to make similar tests to check failing syscalls.

A part of the tool can then be re-used (e.g. storing/loading the
stack traces etc ...).

We then would need some more command line options to control what
to make fail.
Comment 10 Ivo Raisr 2018-07-31 21:14:45 UTC
(In reply to Philippe Waroquiers from comment #9)
> I am wondering also how much difficult it would be to somewhat more
> generalise the tool.
> 
> For example, we might want to make similar tests to check failing syscalls.

Injecting failures in general is a very good topic. However given the imminent Valgrind release, we should decide if this exp-tool can make it for 3.14 in its current shape (and with some regtests available) or not. This decision eventually drives the scope...
Comment 11 Philippe Waroquiers 2018-07-31 21:21:22 UTC
Yes, I agree that having a way to exercise the error recovery/handling
patch of an application is a very good thing to do.

IMO, it is very late now to add this, even as an exp tool.
A.o. we better discuss the desired functionality somewhat more
(e.g. to make a more general/flexible way to make various things fail
in a single 'framework/tool').

For big/huge applications, we might need a lot more control about
e.g. when to start failing (maybe something like what is available
in callgrind:   let the tool start doing something once a call to a function
has been detected, or once a certain stacktrace has been detected
etc ...

In summary: the tool as it is now is too minimal and too 'young'.
Comment 12 Philippe Waroquiers 2018-07-31 21:22:07 UTC
((In reply to Philippe Waroquiers from comment #11)
> Yes, I agree that having a way to exercise the error recovery/handling
> patch of an application is a very good thing to do.
Remove 'patch of' in the above :)
Comment 13 Ivo Raisr 2018-07-31 21:50:21 UTC
(In reply to roger.light from comment #8)
> Created attachment 114229 [details]
> Updated patch with fixes and text callstacks.

Looks quite good. There are just few nits:
- Please remove trailing whitespace you introduced (a few occurrences).
For example 'git am' gives the list.
- Please have a space between 'while' and the condition.
- Please do better job with fixed size buffers (for example in write_callstack_line).
  Some times ago, Florian underwent a painful exercise auditing all fixed size buffers.
- You can use C99 constructs.
- Description for command line option --callstack-file deserves a better wording.
- Description for command line option --show-fail-traces should be probably renamed to --show-failed-traces (or simply --show-failed). The sentence should start with capital letter anyway.
- Fix indentation of function af_instrument.

I will post my findings about using tool in a separate reply.
Comment 14 Ivo Raisr 2018-07-31 22:30:17 UTC
The current exp-allocfail crashes badly on my Ubuntu 18.04 box.
When running './vg-in-place --tool=exp-allocfail /bin/date', it crashes at af_main.c:119.
That's because i is equal to an equivalent of '-1' (4294967295).

You need to fix the code at af_main.c:96
-   UInt i;
+   Int i;


With this fix in place, I was able to play with ordinary Linux/UNIX standalone programs.
Will try to do 'make dist' and check if everything works ok tomorrow.
Comment 15 Roger Light 2018-07-31 23:19:51 UTC
Thanks for the comments and review.

I think adding greater capability for controlling where and when failures occur, and adding syscall support could turn this into a really useful tool. I don't think that should take away from there already is though.

How about renaming to "failcheck" for example, and rewriting a load of the text to make it clear the tool is about failure checking in general and at the moment considers heap allocation failures, then expanding the scope once you are happy with everything as it stands.

Ivo, I'm annoyed I missed the whitespace, I do pay attention to that sort of thing. For af_instrument, you just mean the 4 spaces indentation instead of 3, nothing to do with the function arguments?
Comment 16 Roger Light 2018-07-31 23:43:05 UTC
Created attachment 114245 [details]
Updated patch with UInt fix, plus others

This addresses the points in comments 13 and 14.

The reason for the crash is now obvious, I hadn't spotted it because it only manifests on stripped binaries, or if --show-below-main is set.
Comment 17 Philippe Waroquiers 2018-08-01 20:18:16 UTC
(In reply to Roger Light from comment #15)
> Thanks for the comments and review.
> 
> I think adding greater capability for controlling where and when failures
> occur, and adding syscall support could turn this into a really useful tool.
> I don't think that should take away from there already is though.
> 
> How about renaming to "failcheck" for example, and rewriting a load of the
> text to make it clear the tool is about failure checking in general and at
> the moment considers heap allocation failures, then expanding the scope once
> you are happy with everything as it stands.

Without some more control on when to start creating failures, the tool
will limited to very small applications : big apps quickly have 
several thousands different alloc stack traces.
So, IMO, too early to push in upstream : introducing a new tool has
a cost in terms of integrating the patches, and then implies a (forever)
cost (maintenance, additional build time, etc ...), and this cost
will very probably not be compensated by very usable functionalities.
Comment 18 Roger Light 2018-08-02 08:54:38 UTC
Yes, I see your point. I had this in my mind as more of a tool for working with test suites where you had more control over what was happening anyway, but in reality people will use it as they will.
Comment 19 roger.light 2018-08-22 22:45:45 UTC
Created attachment 114553 [details]
Updated patch with better control.

This is a work-in-progress update in case there are comments while I'm working on the documentation and tests.

* Renamed to failgrind
* Added failgrind.h with macros for turning on, turning off, and toggling allocation failures, and clearing the in-memory call stack list.
* Added monitor commands to do the same, plus write the current in-memory call stack list to a file, and print or zero stats.
* Split --callstack-file into --callstack-input and --callstack-output to allow these to be independent
* Added --alloc-threshold-high, --allow-threshold-low, and --alloc-threshold-invert to allow allocations larger/small than a size to always be allowed.
* Added --alloc-fail-atstart to allow allocation failures to be disabled on start of failgrind (then to be enabled later on with gdb or --toggle-fn)
* Added --toggle-fn that toggles the allocation failure enable/disable state when a names function is both called and later returned from, to allow failgrind testing to be isolated to certain parts of the program.

I think that covers most of the relevant control features from callgrind, if you have any comments or suggestions down that road please let me know.

I'm least happy that I'm doing everything right in the instrumentation function, so if you did want to take a quick look that would be where I'd appreciate some feedback.
Comment 20 Ivo Raisr 2018-08-22 23:29:07 UTC
Thank you for your work. This is going to be a useful Valgrind tool.

I like the documentation for the latest patch. Reasoning explained well on examples. You could mention that the default behaviour (fail unseen allocations) can be heavily tuned by options which are described later.

As regards instrumentation: unfortunately I don't think the current scheme of detecting function entry/return w.r.t. --toggle-fn is going to work. Instruction instrumentation is done by blocks and their processing is mostly orthogonal to the actual execution. I think Callgrind tool solves similar problem - detecting when a function is entered and returned - try to have a look there. I think it still suffers from some limitations on arm architecture, though.

For coding style - please use explicit check for NULL or != NULL.
For example if (instr_callstack) => if (instr_callstack != NULL)
Comment 21 roger.light 2018-08-23 09:09:42 UTC
I'll make the if explicit changes.

I still need to have a proper go over the documentation to tie it together as a whole and give more examples of all the options and how they interact - I'll cover the point you mention then. I haven't changed the bulk of what is written there since adding these new controls.

Instrumentation certainly takes a bit to get your head around, I don't think I've done that quite yet. I tested the entry/return on a simple example and it worked, but I've just tried something more realistic and you are correct that it fails - it does not detect the return point. I have spent time looking at the callgrind code, but it is involved to say the least. I'll dig into it some more.
Comment 22 roger.light 2018-08-26 21:31:21 UTC
Created attachment 114630 [details]
Updated patch - tests and other improvements

This new patch includes a set of tests covering the majority of the command line options and client requests. The documentation is improved as well. I've added more client requests for more effective use with a test suite - if there were 0 allocations rejected you know the test covered all options, for example.

This has the toggle function removed. I still believe it is a very important feature for the tool, but I haven't got it working yet without ~4000 lines of code from callgrind. The patch is otherwise "complete".
Comment 23 roger.light 2018-09-06 21:37:42 UTC
Created attachment 114817 [details]
Patch with function enter/leave detection code from callgrind.

This patch is the same as 114630, but with code borrowed from callgrind for tracking the callstack for detecting function entering/leaving. I've minimised the code from callgrind as much as I think is sensible, but it may be possible to remove more with some thought. It is still 3800 lines of text from callgrind compared to 1100 for the rest of failgrind.

There's effectively a lot of duplicate code here, so I'll be interested to hear your comments.
Comment 24 roger.light 2018-10-17 21:16:23 UTC
Created attachment 115711 [details]
Patch that allows tools to set a syscall error.
Comment 25 roger.light 2018-10-17 21:17:19 UTC
Created attachment 115712 [details]
Patch that implements VG_(apply_ExeContext)().
Comment 26 roger.light 2018-10-17 21:18:01 UTC
Created attachment 115713 [details]
Tool patch October 2018
Comment 27 roger.light 2018-10-17 21:33:31 UTC
I've just added some updated patches. The two small ones add some coregrind features that are needed for this tool - implementing VG_(apply_ExeContext)() and adding a way for a tool to force a syscall to fail in the pre_syscall callback.

The big one is the patch for the tool. This now has a lot of control over how the allocation failures occur. It can be turned on/off at the start of the run, set to toggle on entry and exit from named functions, limited to a certain number of failures per run, automatically accepted based on size of allocation requested, automatically accepted based on matching function name, automatically accepted based on a dice roll, or to some degree controlled with gdb  or client requests.

It also implements syscall failures, which is turned off by default. If you turn it on, then all syscalls (bar those that are marked as always succeeding) will fail with errno set to EINVAL. You can change to have all syscalls fail with a different errno, and/or set individual syscalls to fail with specific errors. You can set all syscalls with a matching name to be automatically accepted. You can set all syscalls that aren't given a specific error to be automatically accepted. You can accept based on the number of failures in a run, or based on a dice roll, or to some degree controlled with gdb or client requests.

There are tests for most features, but they haven't been tried on different Linux releases or distributions (Kubuntu 18.04 only) nor on Mac or Solaris.

The documentation is pretty complete with examples for most situations.

The bits that are missing - I've taken a lot of code from callgrind to support the function enter/leave feature, but I may have taken too much out. The bit that needs looking at is function name wildcard support which I initially removed but now see it was included in callgrind to allow C++ names to be dealt with correctly. That needs putting back in, but I could do with a break from this for a few days at least and was hoping for some feedback.
Comment 28 Philippe Waroquiers 2018-10-18 18:04:21 UTC
(In reply to roger.light from comment #25)
> Created attachment 115712 [details]
> Patch that implements VG_(apply_ExeContext)().

2 comments:
* As ExeContext maintains internally its epoch, I do not understand
  why there is a parameter 'ep' in VG_(apply_ExeContext).
  There should be no reason to call VG_(apply_ExeContext) with another ep
  than the ExeContext ep, and so VG_(apply_ExeContext) can retrieve it
  itself.
* The code of VG_(apply_ExeContext) looks very similar to 
   VG_(apply_stacktrace). It looks like VG_(apply_ExeContext) can just
   make a call to VG_(apply_stacktrace).

Thanks
Philippe
Comment 29 roger.light 2018-10-19 06:11:10 UTC
Created attachment 115744 [details]
Updated apply_ExeContext after comments.

Thanks for your comments. The epoch parameter was there because I hadn't seen that an ExeContext contains its own epoch. You're also right that it could just call apply_StackTrace. I've updated the patch to do that.
Comment 30 Tom Hughes 2018-10-19 07:30:34 UTC
Nick beat you to it and applied something very similar a few hours ago in https://sourceware.org/git/gitweb.cgi?p=valgrind.git;a=commitdiff;h=8b689c66d9b7f12852048db0e842560761bbf904.
Comment 31 Roger Light 2019-03-12 16:28:16 UTC
Created attachment 118745 [details]
Patch 1/4, heap memory allocation failures only

This patch adds the failgrind tool, with support for failing heap allocations with a variety of control. There is no instrumentation and hence no command line capability of controlling when the failures are enabled. The gdb interface and client requests are available and can be used to control when failures are enabled, but I doubt either of those are used as widely as just command line options.
Comment 32 Roger Light 2019-03-12 16:29:40 UTC
Created attachment 118746 [details]
Patch 2/4, add VG_(force_syscall_error)

This adds VG_(force_syscall_error) to coregrind, which allows tools to force an error value when a syscall fails. This is necessary for the next patch.
Comment 33 Roger Light 2019-03-12 16:31:28 UTC
Created attachment 118747 [details]
Patch 3/4, support for syscall failures

This adds the Failgrind code that handles syscall failures, in terms of the command line options, keeping track of already seen syscall call stacks, the gdb monitor and client requests and similar, but does not include any instrumentation code, so it doesn't work on its own.
Comment 34 Roger Light 2019-03-12 16:35:20 UTC
Created attachment 118748 [details]
Patch 4/4, function tracing code from callgrind

This adds a cut down version of the function tracing code from callgrind to failgrind, which allows the syscall failure code to work, as well as the --alloc-toggle-fn option, which is likely to be a very common way this tool is used. The tests work under this patch as well.

This patch is for demonstrating the tool operation, but I don't think it is a sensible patch for inclusion due to the huge amount of duplication with callgrind.
Comment 35 Roger Light 2021-10-25 23:11:53 UTC
Created attachment 142878 [details]
Patch against 3.18.1

This is the old patches combined and refreshed for 3.18.1.

It also includes the failgrind-helper utility, which makes it easy to run failgrind against an executable repeatedly until it completes with all memory allocations/syscalls in the program run succeeding, or the executable segfaults - at which point you can see which failures caused the segfault. It is imperfect, but a big improvement over manual testing.