Bug 306035 - s390x: Fix IR generation for LAAG and friends
Summary: s390x: Fix IR generation for LAAG and friends
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: unspecified
Platform: unspecified Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-30 15:02 UTC by Florian Krohm
Modified: 2012-12-02 21:35 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
initial patch for LAA insn. using CAS IR (4.73 KB, patch)
2012-09-06 08:50 UTC, divya
Details
laa patch (4.49 KB, patch)
2012-11-22 16:32 UTC, divya
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Florian Krohm 2012-08-30 15:02:02 UTC
There is IR generation for these opcodes but it ignores this little tidbit:

The fetch of the second operand for purposes of
loading and the store into the second-operand loca-
tion appear to be a block-concurrent interlocked-
update reference as observed by other CPUs. A spe-
cific-operand-serialization operation is performed.

This needs to be modelled correctly.
Comment 1 divya 2012-09-06 06:48:48 UTC
These instructions need to be fixed with CAS IR.

LOAD AND ADD (LAA, LAAG)
LOAD AND ADD LOGICAL (LAAL, LAALG)
LOAD AND AND (LAN, LANG)
LOAD AND EXCLUSIVE OR (LAX, LAXG)
LOAD AND OR (LAO, LAOG)
Comment 2 divya 2012-09-06 08:50:54 UTC
Created attachment 73698 [details]
initial patch for LAA insn. using CAS IR
Comment 3 Florian Krohm 2012-09-12 21:10:42 UTC
I need to understand this better...
This is what is happening:

orig = memory[r2 + d2]
memory[r2 + d2]  = memory[r2 + d2] + r3
r1 = orig

Now the tricky bit (from POP):
The fetch of the second operand for purposes of
loading and the store into the second-operand loca-
tion appear to be a block-concurrent interlocked-
update reference as observed by other CPUs. 
-----

The way I understand this is:  The fetch from memory and the subsequent store are a single atomic
operation. This implies that the addition is part of the atomic operation, because it is the sum that is being stored. Correct ?
If so, I don't think this can be done correctly with IRCAS.  But before we go further let's first clarify this bit.
Comment 4 Julian Seward 2012-09-12 21:19:33 UTC
> orig = memory[r2 + d2]
> memory[r2 + d2]  = memory[r2 + d2] + r3
> r1 = orig

> If so, I don't think this can be done correctly with IRCAS.

Generating IR like this might work:

restart:
  ea = r2 + d2
  old = *ea
  new = old + r3
  success = CAS(ea,  old -> new)
  if (!success) goto restart
  r1 = orig

The trick in this case is that you can't modify guest state or memory before
the (IR)CAS, since if the CAS fails then guest state isn't the same as when
the insn was started.  But in this case I think you are OK.
Comment 5 Florian Krohm 2012-09-13 15:35:42 UTC
Yes, this would work. However, tool instrumenters would possibly see more memory loads than actually happen. I always thought that was forbidden (which makes sense).  Some longish time ago you sent me mail about the guarantees IR makes but I haven't been able to locate it. I seem to remember that you stated that IR is precise when it comes to memory accesses -- that would include frequency of access.
Comment 6 Julian Seward 2012-10-08 09:29:09 UTC
(In reply to comment #5)

Well, hmm, I suspect in the x86/amd64/arm front ends there might be
cases where we emit more loads than strictly necessary.  IIRC there
are some nasty hacks to do with BSF/BSR instructions, for one thing.
Whilst not ideal, I don't think it is a big deal if these insns get
load-counts which overstate the number of loads, providing they are
not common.  Of course for normal, boring, 99% of loads we want to
have the correct numbers, but these load-atomic-ops category does not
seem to me to be common.

So I think it's fine.
Comment 7 Florian Krohm 2012-10-09 01:34:33 UTC
Yeah, it seems reasonable given that those insns will be rare, I suspect.
The alternative of dreaming up and implementing yet another IROP for this is certainly much less attractive.
We'll put a comment in the code about the possible multiple reads  so this won't be forgotten.
Comment 8 divya 2012-11-22 16:32:54 UTC
Created attachment 75411 [details]
laa patch

Hi Florian,

Please check If my IR code matches julian's given steps.
Comment 9 Florian Krohm 2012-12-02 21:22:53 UTC
(In reply to comment #8)
> Created attachment 75411 [details]
> laa patch
> 
> Hi Florian,
> 
> Please check If my IR code matches julian's given steps.

I've applied the patch (VEX r2576 and valgrind r13146) with the following mods:

I added tests such that the values of the 1st and 2nd operands are
different from the values they will be assigned after the operation.
For instance

test(0, 0, 0, 0);

There is no way to tell that op1 was assigned. But if I change it to

test(15, 0, 0, 0);

then I can tell. A similar argument holds for op2.

I included a test for cc=3.

In s390_irgen_LAA:
The comment for the condition code was not quite correct:

 op2 + op3 < 0(overflow) cc = 1

There is no overflow in this case.

+   iterate_if(binop(Iop_CmpNE32, mkexpr(op2), mkexpr(old_mem)));

I changed this to a yield_if. If the CAS failed, then this thread
will not make progress, in the sense that next time the CAS will fail
again. So let's tell the scheduler, to run another thread for a while.