
Hi, I'm trying to add support for the POPCNT instruction, which exists on some modern CPUs (e.g. Nehalem). The idea is to add a popCnt# primop which would generate a POPCNT instruction when compiling with -msse4.2. If the user didn't specified -msse4.2, the primop should fall back to some other implementation of population count. A good fallback, in terms of both speed and memory usage, is this lookup-table based function: static char popcount_table_8[256] = { /*0*/ 0, /*1*/ 1, /*2*/ 1, /*3*/ 2, /*4*/ 1, /*5*/ 2, /*6*/ 2, /*7*/ 3, /*8*/ 1, /*9*/ 2, /*10*/ 2, /*11*/ 3, ... }; /* Table-driven popcount, with 8-bit tables */ /* 6 ops plus 4 casts and 4 lookups, 0 long immediates, 4 stages */ inline uint32_t popcount(uint32_t x) { return popcount_table_8[(uint8_t)x] + popcount_table_8[(uint8_t)(x >> 8)] + popcount_table_8[(uint8_t)(x >> 16)] + popcount_table_8[(uint8_t)(x >> 24)]; } (GCC and LLVM use the same fallback method.) It's important that the fallback is as good as it gets so that the user of the primop doesn't have to implement their own fallback (which is very complicated as the user would have to detect whether -msse4.2 is used or not!). This precludes non-table based solutions (as they're slower). I've implemented the primop but run into some difficulty: to use the above fallback I need the code to be statically linked into every binary. I'm not quite sure how to achieve that. GCC manages by having the above function definition in libc, which is always statically linked. I think LLVM uses a small statically linked compiler run-time library for the same purpose. How would one go about having a small C library linked into every Haskell binary? If we go ahead and implement more of these modern instructions we're likely to need more fallbacks (so this isn't needed by just POPCNT). Cheers, Johan