First, be aware of
https://ghc.haskell.org/trac/ghc/ticket/8453, which causes programs compiled with -threaded and -prof to occasionally die with an assertion failure (there are a few other, possibly related, tickets about rts problems with -threaded and non-vanilla ways).
Next, define what you mean by "faster": more throughput? Lower latency? Something else?
One approach is to build with profiling and try to optimize the functions exposed by your API. You could do this on one core. The optimizations you'd get from this would be generally useful, but they wouldn't be optimizations to reduce contention.
To look into contention issues, I think the best way is to build with the eventlog enabled and use threadscope. This will show pretty clearly where threads are blocked, for how long, etc. I've also had success with timing actions within my test executable and adding that information to the eventlog with Debug.Trace.traceEventIO. Then you can see that information within threadscope, or grep it out of the eventlog for extra processing (min/max/mean, that sort of thing).
Running with -N1 can be faster because there essentially is no contention: only a single Haskell thread will be executing at any given time. If -N1 is markedly faster than -N2 (as in, the runtime is longer to complete the same amount of work), I would try debugging with Threadscope first.
I'd appreciate any further suggestions also.
John L.