
Hi, I'm trying to optimise a short Haskell program (inspired by Rust's performance). From simple profiling, I discovered that 90+% of time and allocations is spent in a `ReadP`-based parser. The parser is fairly small, I assumed GHC would inline/optimise most of it. I tried adding some `SPECIALISE` pragmas but found out those don't work: they require `INLINE`/`INLINEABLE` pragmas or compiling with `-O`. I'm confused by the latter requirement as I thought dependencies of the Stack project are compiled with optimisation enabled. I copied the entirety of the `ReadP` module into my project along with what I used from `Text.Read.Lex` to be able to optimise this code locally. I managed to get some speedups by specialising the integer parsers. Cloning the `ReadP` code also helped with more detailed profiling reports, I now know that the majority of time and allocations are spent in `>>=`. That's it for the long-winded intro, here are my questions: - what's the problem with GHC complaining about missing `INLINE(ABLE)` pragmas and/or `-O` in Stack dependencies? - Are these things just never inlined unless library authors specifically mark them as such? - What profiling tools do people usually use in situations like this? - I cannot specialise `>>=` even when I mark it `INLINEABLE` in the `Monad ReadP` instance, apparently, the pragma should be placed at the declaration site (at least that's what the warning says). So I run into the same issue with `INLINE(ABLE)`/`-O` as above. That seems pretty silly, I feel like I should be able to inline/specialise this particular instance and not worry about the global declaration of `>>=`. Is there a workaround for this? - I also used `BangPatterns` in my optimisations, but it's been hit-and-miss. They (surprisingly) helped in small, local definitions marked `INLINE` (I assumed strictness analysis would pick up on those), but in other places, such as `ReadP`'s `>>=`, strictness annotations slightly worsened performance. How do people look for good patterns to annotate as strict? - How do I find out to what extent is laziness slowing things down, or look for places with unintended thunks? As these are general optimisation questions, I'm not looking for parsing alternatives suggestions. I'd like to write high-level code and have it efficiently compose, this is a toy program anyway and I'm wondering how far can I get with such a simple parsing library. I suspect the majority of allocations coming from the parser are short-lived garbage that I should be able to avoid with optimisations and maybe a little bit of refactoring. One thing that's specific to my implementation is that the parser first produces a `[(Int, Int)]` and then converts that to `IntMap (Set Int)` with a `foldl'`. I would like to ensure the list is never built in the first place, is there a way to do that? As to the code itself, it's only 90 lines, but I'm not sure what the consensus is on attachments in this mailing list. I can provide the whole Stack project along with a test case and a reference solution in Rust upon request. (I'd post it on my GitHub, but there's a slight chance a participant in the competition this is a reference solution to may find it there.) Best regards, Andrew
participants (1)
-
Andrew Kvapil