Version: 2.1.2 (using KDE KDE 3.2.3) Installed from: Compiled From Sources Compiler: gcc 3.2.3 OS: Linux The LOCK assembly prefix used to guarantee atomicity at instruction level seems to have no effect when used with Valgrind. The following code is used by Oracle to compare and swap for speedy concurrent access to structures in shared memory. ------- .text .align 16 .globl swapit .type swapit,@function swapit: movl 4(%esp),%ecx /* ecx = data address */ movl 8(%esp),%eax /* eax = old_value */ movl 12(%esp),%edx /* edx = new_value */ lock; cmpxchgl %edx,(%ecx) /* Atomic compare and swap */ setz %al movzbl %al,%eax /* put ZF into %eax */ ret /* return */ .size swapit,.-swapit ------- When --trace-codegen=11111 is enabled the following is noticed In the first stage x86 -> UCODE I see the LOCK instruction (2) 0x8048428: movl 12(%esp,,),%edx (2) (2) 10: GETL %ESP, t14 (2) 11: LEA1L 12(t14), t12 (2) 12: LDL (t12), t16 (2) 13: PUTL t16, %EDX (2) 14: INCEIPo $4 (2) (2) 0x804842C: cmpxchgl %edx,(%ecx) (2) ==>(2) 15: LOCKo (2) 16: GETL %ECX, t26 (2) 17: LDL (t26), t22 (2) 18: GETL %EDX, t20 (2) 19: GETL %EAX, t18 (2) 20: MOVL t18, t24 (2) 21: SUBL t22, t24 (-wOSZACP) (2) 22: CMOVLz t20, t22 (-rOSZACP) (2) 23: CMOVLnz t22, t18 (-rOSZACP) ---------------------- At the last stage i.e after instrumentation I donot see the LOCK (2) 12: LDL (t12), t16 (2) 13: PUTL t16, %EDX (2) 14: INCEIPo $4 (2) 15: CCALLo 0xB72A4BB1(t4) (2) 16: LDL (t4), t22 (2) 17: MOVL t10, t24 (2) 18: SUBL t22, t24 (-wOSZACP) (2) 19: CMOVLz t16, t22 (-rOSZACP) (2) 20: CMOVLnz t22, t10 (-rOSZACP) ------------------- A simple testcase is attached that uses the above assembly code and can be used to judge the presence of LOCK mechanism. The trials were done on a, $ uname -a Linux stacj32 2.4.21-15.ELsmp #1 SMP Thu Apr 22 00:27:41 EDT 2004 i686 i686 i386 GNU/Linux $ valgrind --version valgrind-2.1.2 (The recently release development version) 4 files are attached. swap.s, client.c, server.c, Makefile The swap.s is a assembly code containing swapit function that performs atomic swap operation. A shared memory location, old_value and new_value is passed to this function. The swapit function reads the shared memory and compares it with the old_value if they are equal then the new_value is overwritten. If they differ swapit returns false. The server.c code creates a shared memory segment of size 5 bytes. The last byte is used to control the start stop of the clients. When its set to 1 the clients are ready to go. when its set to 0 the clients exit. The server tries to swap a 0 value in the first 4 bytes (int) with a 1. The client.c attaches to the shared memory segment and tries to swap 1 with a 0. Minimum of 2 instances of the client should be started. The tests were done with 3 clients started quickly one after the other. When the server successfully swaps 10000 times it sets the 5th byte to 0 to signal clients to exit and then the server exits. You have to run the server first . server accepts a integer argument. This specifies the number of successful swaps it waits for before exiting. The default value is 10000. After execution the clients and the server print a count of the number of successful swaps they performed. The server's count stays at 10000 or the value specified in the command line. The sum of the client's count should match the server's count. The count differs if two clients read 1 at the same time and update it with a 0 or a similar case. A differing count signifies no LOCK. When the test was run on a smp kernel(2.4.21-15.ELsmp), without valgrind the counts matched. When valgrind was involved the sum of the swaps of the clients exceeded the server's count. when server's count was 10000 the clients total was 10988. When the test is run on a non-smp kernel(2.4.21-15.EL) without valgrind the total of the clients was equal to that of the server. With valgrind sometimes I found that the client's total falls below server's total. I couldn't understand this behaviour. I set 1000 as server's count and got 977 as the sum of the client's count. In any case this can happen only when atomicity is absent. The trials were done on a 2 cpu box.
Created attachment 6792 [details] assembly code using LOCK
Created attachment 6793 [details] driver file
Created attachment 6794 [details] clients
Created attachment 6795 [details] build
I'm not sure there's much we can do about this as there is no guarantee that a single instruction in the executable being emulated will translate to a single instruction in the generated code so it may not be possible to preserve a LOCK prefix in all cases.
LOCK prefix guarantees instruction level atomicity in a multicpu environment, essential for mutual exclusion and represents a very important feature. The point is not just that LOCK prefix is missing in generated code, but because of it atomicity is not guaranteed.
I know that, what I'm saying is that it is not (in general terms) possible for valgrind to provide the atomicity you require because there is no guarantee that the instruction in the input stream that you want to be atomic will be single instruction in the output stream. That's just how valgrind works. It might be possible to preserve the LOCK prefix when there is a one-one mapping between instructions, and valgrind should certainly warn loudly when it is ignoring a LOCK prefix.
I got your point. Thanks. Then the alternative would be to go for high level language mutex algorithms. This bug is a good item for the wishlist I guess.
One of the local experts mentioned that a lock prefix use with cmpxchg can be implemented with a cmpxchg lfence and doesnot require lock prefix. But ofcourse it can only work with P4.
Julian, will this be fixed by bug 197793? If so, this can be marked as a dup of it.
*** This bug has been marked as a duplicate of bug 197793 ***