I'll add making that change to my inline primops task. (It does give nice reference asm for that task. )

On Feb 1, 2014 1:00 PM, "Simon Marlow" <marlowsd@gmail.com> wrote:
So there's no problem in casMutVar#?

There's probably no good reason not to use the gcc intrinsics.  Either they didn't exist when I wrote the inline asm, or they had been added to gcc since I last read the docs.

Cheers,
Simon

On 01/02/2014 19:35, Ryan Newton wrote:
Ok, my bad, sorry all.

This is NOT a problem that will crop up in 7.8.   Rather, it's just a
problem with the duplicated bits of GHC RTS functionality that were
stuck into the atomic-primops library.  It was a C preprocessor problem
that was causing the inline asm we were discussing in this thread to not
actually be called.

Still, I'd like to be reminded of the rational for all this conditional
inline asm rather than using the C compiler intrinsics!  Anyone?

Best,
   -Ryan



On Sat, Feb 1, 2014 at 2:17 PM, Carter Schonwald
<carter.schonwald@gmail.com <mailto:carter.schonwald@gmail.com>> wrote:

    I got the test suite running on my (2 core) machine mac book air,
    with 7.8
    i've run it several times, not seeing any failures


    On Sat, Feb 1, 2014 at 1:03 PM, Carter Schonwald
    <carter.schonwald@gmail.com <mailto:carter.schonwald@gmail.com>> wrote:

        hrmmmm I have a crazy idea

        "Compare RAX with r/m64. If equal, ZF is set and r64 is loaded
        into r/m64. Else, clear ZF
        and load r/m64 into RAX." is what the docs say for the cmpxchng
        instruction

        so RAX is the old values,  (EAX in the  32bit case). And it
        looks like we dont' set that explicitly when calling the asm ..
        CMPXCHG r/m64, r64

          hrmmm




        On Sat, Feb 1, 2014 at 12:52 PM, Ryan Newton <rrnewton@gmail.com
        <mailto:rrnewton@gmail.com>> wrote:

            Ok, here's another experiment, on this commit:

            https://github.com/rrnewton/haskell-lockfree/commit/399bb19fa02eaf2f2eab5d02c4b608535362f9bc

            Here, if I use GCC's __sync_val_compare_and_swap instead of
            GHC's version of cas(), the problem also goes away.  I think
            these two implementations should behave identically, and
            that they don't perhaps indicates that there is something
            off about the inline asm, as Carter was suggesting:

            *#if i386_HOST_ARCH || x86_64_HOST_ARCH*
            *    __asm__ __volatile__ (*
            * "lock\ncmpxchg %3,%1"*
            *          :"=a"(o), "=m" (*(volatile unsigned int *)p) *
            *          :"0" (o), "r" (n));*
            *    return o;*

            The x86 CAS instruction must put the "return value" in the
            accumulator register, and indeed this constrains "o" to be
            allocated to the accumulator register, while the new value
            "n" can be in any register.

            So if there's a problem, I don't know what it is.  Except
            I'm not sure what the ramifications of "o" being a function
            parameter AND having an "=a" constraint on it are...

                -Ryan



            On Sat, Feb 1, 2014 at 11:27 AM, Carter Schonwald
            <carter.schonwald@gmail.com
            <mailto:carter.schonwald@gmail.com>> wrote:

                Hey Ryan, i've made the leap to using 7.8 on my machine,
                so i'll first have to get some pull requests in on
                atomic-primops before I can test it locally :), expect
                those patches later today!

                looks like gcc's inline ASM logic is pretty correct,
                after testing it a bit locally, pardon my speculative
                jumping the gun.


                On Sat, Feb 1, 2014 at 9:10 AM, Ryan Newton
                <rrnewton@gmail.com <mailto:rrnewton@gmail.com>> wrote:

                    Hi Carter & others,

                    Carter, yes, this is CAS on pointers and in my next
                    mail I'll try to come up with some hypotheses as to
                    why we may have (remaining) problems there.

                    But first, I have been assured that on x86 there is
                    no failure mode in which doing a comparison on the
                    value read by CAS should not correctly diagnose
                    success or failure (same as directly reading the
                    Zero Flag) [1].

                    And yet, there's this discrepancy, where the
                    modified casMutVar that I linked to does not have
                    the failure.  As for reproducing the failure, either
                    of the two following tests will currently show problems:

                      * Two threads try to casIORef False->True, both
                        succeed
                      * 120 threads try to read, increment, CAS until
                        they succeed.  The total is often not 120
                        because multiple threads think the successfully
                        incremented, say, 33->34.

                    Here's a specific recipe for the latter test on GHC
                    7.6.3 Mac or Linux:

                    *git clone
                    git@github.com:rrnewton/haskell-lockfree-queue.git
                    *
                    *cd haskell-lockfree-queue/AtomicPrimops/*
                    *git checkout 1a1e7e55f6706f9e5754
                    *
                    *cabal sandbox init
                    *
                    *cabal install -f-withTH -fforeign ./ ./testing
                    --enable-tests
                    *
                    *./testing/dist/dist-sandbox-*/build/test-atomic-primops/test-atomic-primops
                    -t n_threads
                    *

                    You may have to run the last line several times to
                    see the failure.

                    Best,
                       -Ryan

                    [1] I guess the __sync_bool_compare_and_swap
                    intrinsic which reads ZF is there just to avoid the
                    extra comparison.

                    [2] P.S. I'd like to try this on GHC head, but the
                    RHEL 6 machine I usually use to build it is
                    currently not validating (below error, commit
                    65d05d7334).  After I debug this gmp problem I'll
                    confirm that the bug under discussion applies on the
                    7.8 branch.

                         ./sync-all checkout ghc-7.8
                         sh validate
                    ...
                    /usr/bin/ld: libraries/integer-gmp/gmp/objs/aors.o:
                    relocation R_X86_64_32 against `__gmpz_sub' can not
                    be used when making a shared object; recompile with
                    -fPIC
                    libraries/integer-gmp/gmp/objs/aors.o: could not
                    read symbols: Bad value




                    On Sat, Feb 1, 2014 at 2:55 AM, Carter Schonwald
                    <carter.schonwald@gmail.com
                    <mailto:carter.schonwald@gmail.com>> wrote:

                        Ryan, is your benchmark using CAS on pointers,
                        or immediate words? trying to get atomic primops
                        to build on my 7.8 build on my mac


                        On Sat, Feb 1, 2014 at 2:44 AM, Carter Schonwald
                        <carter.schonwald@gmail.com
                        <mailto:carter.schonwald@gmail.com>> wrote:

                            https://ghc.haskell.org/trac/ghc/ticket/8724#ticket
                            is the ticket

                            when i'm more awake i'll experiment some more


                            On Sat, Feb 1, 2014 at 2:33 AM, Carter
                            Schonwald <carter.schonwald@gmail.com
                            <mailto:carter.schonwald@gmail.com>> wrote:

                                i have a ticket for tracking this,
                                though i'm thinking my initial attempt
                                at a patch generates the same object
                                code as it did before.

                                @ryan, what CPU variant are you testing
                                this on? is this on a NUMA machine or
                                something?


                                On Sat, Feb 1, 2014 at 1:58 AM, Carter
                                Schonwald <carter.schonwald@gmail.com
                                <mailto:carter.schonwald@gmail.com>> wrote:

                                    woops, i mean cmpxchgq


                                    On Sat, Feb 1, 2014 at 1:36 AM,
                                    Carter Schonwald
                                    <carter.schonwald@gmail.com
                                    <mailto:carter.schonwald@gmail.com>>
                                    wrote:

                                        ok, i can confirm that on my
                                        64bit mac, both clang and gcc
                                        use cmpxchgl rather than cmpxchg
                                        i'll whip up a strawman patch on
                                        head that can be cherrypicked /
                                        tested out by ryan et al


                                        On Sat, Feb 1, 2014 at 1:12 AM,
                                        Carter Schonwald
                                        <carter.schonwald@gmail.com
                                        <mailto:carter.schonwald@gmail.com>>
                                        wrote:

                                            Hey Ryan,
                                            looking at this closely
                                            Why isn't CAS using
                                            CMPXCHG8B on 64bit
                                            architectures?  Could that
                                            be the culprit?

                                            Could the issue be that
                                            we've not had a good stress
                                            test that would create
                                            values that are equal on the
                                            32bit range, but differ on
                                            the 64bit range, and you're
                                            hitting that?

                                            Could you try seeing if
                                            doing that change fixes
                                            things up?
                                            (I may be completely wrong,
                                            but just throwing this out
                                            as a naive "obvious" guess)


                                            On Sat, Feb 1, 2014 at 12:58
                                            AM, Ryan Newton
                                            <rrnewton@gmail.com
                                            <mailto:rrnewton@gmail.com>>
                                            wrote:

                                                Then again... I'm having
                                                trouble seeing how the
                                                spec on page 3-149 of
                                                the Intel manual would
                                                allow the behavior I'm
                                                seeing:

                                                http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

                                                Nevertheless, this is
                                                exactly the behavior
                                                we're seeing with the
                                                current Haskell primops.
                                                  Two threads
                                                simultaneously
                                                performing the same
                                                CAS(p,a,b) can both
                                                think that they succeeded.





                                                On Sat, Feb 1, 2014 at
                                                12:33 AM, Ryan Newton
                                                <rrnewton@gmail.com
                                                <mailto:rrnewton@gmail.com>>
                                                wrote:

                                                    I commented on the
                                                    commit here:

                                                    https://github.com/ghc/ghc/commit/521b792553bacbdb0eec138b150ab0626ea6f36b

                                                    The problem is that
                                                    our "cas" routine in
                                                    SMP.h is similar to
                                                    the C compiler
                                                    intrinsic
                                                    __sync_val_compare_and_swap,
                                                    in that it returns
                                                    the old value.  But
                                                    it seems we cannot
                                                    use a comparison
                                                    against that old
                                                    value to determine
                                                    whether or not the
                                                    CAS succeeded.  (I
                                                    believe the CAS may
                                                    fail due to
                                                    contention, but the
                                                    old value may happen
                                                    to look like our old
                                                    value.)

                                                    Unfortunately, this
                                                    didn't occur to me
                                                    until it started
                                                    causing bugs [1]
                                                    [2].  Fixing
                                                    casMutVar# fixes
                                                    these bugs.
                                                      However, the way
                                                    I'm currently fixing
                                                    CAS in the
                                                    "atomic-primops"
                                                    package is by using
                                                    __sync_bool_compare_and_swap:

                                                    https://github.com/rrnewton/haskell-lockfree/commit/f9716ddd94d5eff7420256de22cbf38c02322d7a#diff-be3304b3ecdd8e1f9ed316cd844d711aR200

                                                    What is the best fix
                                                    for GHC itself?
                                                    Would it be ok for
                                                    GHC to include a C
                                                    compiler intrinsic
                                                    like
                                                    __sync_val_compare_and_swap?
                                                      Otherwise we need
                                                    another big ifdbef'd
                                                    function like "cas"
                                                    in SMP.h that has
                                                    the
                                                    architecture-specific inline
                                                    asm across all
                                                    architectures.  I
                                                    can write the x86
                                                    one, but I'm not
                                                    eager to try the others.

                                                    Best,
                                                        -Ryan

                                                    [1]
                                                    https://github.com/iu-parfunc/lvars/issues/70
                                                    [2]
                                                    https://github.com/rrnewton/haskell-lockfree/issues/15



                                                _______________________________________________
                                                ghc-devs mailing list
                                                ghc-devs@haskell.org
                                                <mailto:ghc-devs@haskell.org>
                                                http://www.haskell.org/mailman/listinfo/ghc-devs