I'm having trouble thinking of scenarios where per-CPU does the trick. Do you mean one per pthread rather than one per CPU?
In the Cilk case, you've got to deal with work stealing of course. So you want rand() to generate a result that is determined by the current stack-frame's position in the binary-tree of spawns. In the work I was referring to:
http://groups.csail.mit.edu/sct/wiki/index.php?title=Other_Projects#Deterministic_Parallel_Random-Number_Generation
... they try a bunch of different methods and I can't remember if any of them split the RNG "eagerly" as they go down the spawn tree or if they just record the tree-index on the way down and then read it out when they generated randoms. (I think the latter.)
Cheers,
-Ryan