As far as I can tell there's nothing wrong with your code. My hypothesis is that Haskell optimizes call to sumEuler 5000 by calling it only once in one thread. Here's why I think so:

The program I used for debugging this is:
import Control.Concurrent.Async (async, wait)
import System.IO.Unsafe (unsafePerformIO)

sumEuler :: Int -> Int
sumEuler =  sum . map euler . mkList
           where
             mkList n = seq (unsafePerformIO (putStrLn "Calculating")) [1..n-1]
             euler n = length (filter (relprime n) [1..n-1])
               where
                 relprime x y = gcd x y == 1

p :: Int -> IO Int
p b = return $! sumEuler b

p5000 :: IO Int
p5000 = return $ sumEuler 5000

main :: IO ()
main = do
  a <- async $ p5000
  b <- async $ p5000
  av <- wait a
  bv <- wait b
  print (av,bv)
The two main modification are adding a debugging message "Calculating" when a list [1..(n-1)] is evaluated. The second one is making p into a function. Notice that it uses strict application ($! - check how it works with simple $). I will use that function in further examples.

Running this program with "time ./Test 2 +RTS -ls -N2" gives me:
Calculating
(7598457,7598457)

real    0m3.752s
user    0m3.833s
sys    0m0.211s

Just to be sure I have almost the same time when doing only one computation with:
main :: IO ()
main = do
  a <- async $ p5000
  av <- wait a
  print av
So it seems like the value returned by p5000 is computed only once. GHC might be noticing that p5000 will always return the same value and might try cache it or memoize it. If this hypothesis is right then calling sumEuler with two different values should run in two different threads. And indeed it is so:
main :: IO ()
main = do
  a <- async $ p 5000
  b <- async $ p 4999
  av <- wait a
  bv <- wait b
  print (av,bv)
Gives:
Calculating
Calculating
(7598457,7593459)

real    0m3.758s
user    0m7.414s
sys    0m0.064s

So it runs in two threads just as expected. The strict application ($!) here is important. Otherwise it seems that the async thread returns a thunk and the evaluation happens in print (av, bv) which is evaluated in a single thread.  Also the fact that p5000 is a top level binding is important. When I do:
main :: IO ()
main = do
  a <- async $ p 5000
  b <- async $ p 5000
  av <- wait a
  bv <- wait b
  print (av,bv)
I get no optimization (GHC 7.6.3).

Best,
Greg