New subject: Analyzing Haskell call graph (was: Thread on Discourse - HIE file processing)

31 Jul 2023

Hi Tristan,

⁣I wouldn't do this with Core (cf inlining issue and issue associating what you find with source syntax).

I think you should use the output of the renamer instead. Either with a GHC plugin using `renamedResultAction` or just by dumping the renamed AST (fully qualified) with -ddump-rn-ast -ddump-to-file and grepping for the names you want.

Cheers,
Sylvain 

Le 9 août 2023 à 21:07, à 21:07, Tristan Cacqueray  a écrit:
>
>On Mon, Jul 31, 2023 at 16:26 Tristan Cacqueray wrote:
>> On Mon, Jul 31, 2023 at 11:05 David Christiansen via ghc-devs wrote:
>>> Dear GHC devs,
>>>
>>> I think that having automated security advisory warnings from build
>tools
>>> is important for Haskell adoption in certain industries. This can be
>done
>>> based on build plans, but a package is really the wrong granularity
>- a
>>> large, widely-used package might export a little-used definition
>that is
>>> the subject of an advisory, and it would be good to warn only the
>users of
>>> said definition (cf base and readFloat).
>>>
>>> Tristan is exploring using HIE files to do this check, but I don't
>know if
>>> you read Discourse, where he posted the question:
>>>
>https://discourse.haskell.org/t/rfc-using-hie-files-to-list-external-declarations-for-cabal-audit/7147
>>>
>>
>> Thank you David for bringing this up here. One thing to note is that
>we
>> would need hie files for ghc libraries, as proposed in:
>>   https://gitlab.haskell.org/ghc/ghc/-/merge_requests/1337
>>
>> Cheers,
>> -Tristan
>
>Dear GHC devs,
>
>To recap, the goal of this project is to check if a given declaration
>is
>used by a package. For example, I would like to check if such
>definition: "package:Module.name" is reachable from another module.
>
>In this post I list the considered options, and raise some questions
>about using the simplified core from .hi files. 
>
>I would appreciate if you could have a look and help me figure out the
>remaining blockers. Note that I'm not very familiar with the GHC
>internals and how to properly read Core expressions, so any feedback
>would be appreciated.
>
>
># Context and Problem Statement
>
>We would like to check if a package is affected by a known
>vulnerability. Instead of looking at the build dependencies names and
>versions, we would like to search for individual functions. This is
>particularly important to avoid false alarm when a given vulnerability
>only appears in a rarely used declaration of a popular package. 
>
>Therefor, we need a way to search the whole call graph to assert with
>confidence that a given declaration is not used (e.g. reachable).
>
>
># Considered Options
>
>To obtain the call graph data, the following options are considered:
>
>* .hie files produced when using the `-fwrite-ide-info` flag.
>* .modpack files produced by the [wpc-plugin][grin].
>* custom GHC plugin.
>* .hi files containing the simplified core when using the
>  `-fwrite-if-simplified-core` flag. 
>
>
># Pros and Cons of the Options
>
>### Hie files
>
>This option is similar to what [weeder][weeder] already implements.
>However this file format is designed for IDE, and it may not be
>suitable
>for our problem. For example, RULES, deriving, RebindableSyntax and
>template haskell are not well captured.
>
>[weeder]: https://github.com/ocharles/weeder/
>
>### Modpack
>
>This option appears to work, but it seems overkill. I don't think we
>need to reach for STG representation.
>
>[grin]:
>https://github.com/grin-compiler/ghc-whole-program-compiler-project
>
>### Custom GHC plugin
>
>This option enables extra metadata to be collected, but if using the
>simplified core is enough, then it is just an extra step compared to
>using .hi files.
>
>### Hi files
>
>Using .hi files is the only option that doesn't require an extra
>compilation artifacts, the necessary files are already part of the
>packages.
>
>To collect hie files or files generated by a GHC plugin,
>ghc/cabal/stack
>all need some extra work:
>
>- ghc libraries doesn't ship hie files
>([issue!16901](https://gitlab.haskell.org/ghc/ghc/-/issues/16901)).
>- cabal needs recent changes for hie files
>([PR#9019](https://github.com/haskell/cabal/pull/9019)) and plugin
>artifacts ([PR#8662](https://github.com/haskell/cabal/pull/8662)).
>- stack doesn't seem to install hie files for global library.
>
>Moreover, creating artifacts with a plugin for ghc libraries may
>requires manual steps because these libraries are not built by the
>end user.
>
>Therefor, using .hi files is the most straightforward solution.
>
>
># Questions
>
>In this section I present the current implementation of
>[cabal-audit](https://github.com/TristanCacqueray/cabal-audit/).
>
>
>## Collecting dependencies from core
>
>In the
>[cabal-audit-core:CabalAudit.Core](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-core/src/CabalAudit/Core.hs)
>module I implemented the logic to extract the call graph from core
>expression into a list of declarations composed of
>  `UnitId:ModuleName.OccName` and their dependencies.
>
>Here is an example output for the
>[cabal-audit-test:CabalAudit.Test.User](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-test/src/CabalAudit/Test/User.hs)
>module:
>
>```ShellSession
>$ cabal run -O0 --write-ghc-environment=always cabal-audit-hi --
>CabalAudit.Test.User
>cabal-audit-test:CabalAudit.Test.Inline.fonctionInlined:
>base:GHC.Num.$fNumInt, base:GHC.Num.-, ghc-prim:GHC.Types.I#
>cabal-audit-test:CabalAudit.Test.Instance.$fTestClassTea:
>cabal-audit-test:CabalAudit.Test.Instance.$ctasty1
>cabal-audit-test:CabalAudit.Test.Instance.$fTestClassCofee:
>cabal-audit-test:CabalAudit.Test.Instance.$ctasty
>cabal-audit-test:CabalAudit.Test.Instance.$ctasty:
>ghc-prim:GHC.Classes.&&, ghc-prim:GHC.Types.True
>cabal-audit-test:CabalAudit.Test.Instance.$ctasty1: base:GHC.Base..,
>cabal-audit-test:CabalAudit.Test.Instance.alwaysTrue,
>ghc-prim:GHC.Classes.not
>cabal-audit-test:CabalAudit.Test.Instance.alwaysTrue:
>base:GHC.Base.const, ghc-prim:GHC.Types.True
>cabal-audit-test:CabalAudit.Test.User.monDoubleDecr:
>base:GHC.Num.$fNumInt, base:GHC.Num.-,
>cabal-audit-test:CabalAudit.Test.Inline.fonctionInlined,
>ghc-prim:GHC.Types.I#
>cabal-audit-test:CabalAudit.Test.User.useAlwaysTrue:
>cabal-audit-test:CabalAudit.Test.Instance.Tea,
>cabal-audit-test:CabalAudit.Test.Instance.$fTestClassTea
>cabal-audit-test:CabalAudit.Test.User.useCofeeInstance:
>cabal-audit-test:CabalAudit.Test.Instance.Cofee,
>cabal-audit-test:CabalAudit.Test.Instance.$fTestClassCofee
>```
>
>This appears correct, in particular:
>
>- Type class instances are uniquely identified (that was not working
>well when using a custom plugin).
>- Inlined declaration are not inlined in the simplified core when built
>with `-O0`.
>
>However this is collecting extra definitions that are not part of the
>source file. I understand that '$fTestClassTea' means the 'TestClass'
>instance of 'Tea'. But it seems like the actual implementation is
>behind
>the extra '$ctasty' declaration. Moreover, when analyzing the other
>test
>modules, I see many declarations named 'lvlXX', which I guess are local
>names that have been floated out.
>
>This is not ideal because the resulting graph contains extra edges that
>are not relevant for the end user. I tried to tidy this using
>'isExportedId' and 'idDetails' from 'GHC.Types.Var' but I worry that
>this not a good strategy. So my question is: how to recover the
>original
>declarations context of core expressions, so that the resulting
>dependency graph only contains edges that are part of the source
>declaration? I assume this can be done by dissolving the declarations
>starting with '$' or 'lvl', but it would be good to know how to do that
>reliably. 
>
>
>## Handling inlined declaration
>
>When compiling with `-O1`, declarations seem to be inlined in the
>simplified core. In that case, is it possible to recover the original
>inlined OccName?
>
>If not, I guess we have to use a GHC plugin.
>I investigated this strategy in
>[cabal-audit-plugin:CabalAudit.Plugin](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-plugin/src/CabalAudit/Plugin.hs).
>
>However I am not sure this is done correctly and I could use some 
>guidances on how to proceed.
>
>
>## Loading hidden module
>
>If I understand correctly, accessing the ModIface mi_extra_decls to get
>the simplified core requires an HscEnv. 
>In the
>[cabal-audit-hi:GhcExtras](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-hi/src/GhcExtras.hs)
>module, I put together the following helpers using GHC as a library:
>
>```haskell
>-- | Setup a Ghc session using the packages found in the local
>environment file
>runGhcWithEnv :: Ghc a -> IO a
>
>-- | Lookup a module and extract the simplified core.
>getCoreBind :: ModuleName -> Maybe FastString -> Ghc (Maybe (Module,
>[CoreBind]))
>```
>
>However this doesn't work for hidden modules, trying to load them with
>'GHC.lookupModule' fails with this error:
>
>```ShellSession
>    Could not load module `GHC.Event.Thread'
>    it is a hidden module in the package `base-4.18.0.0'
>```
>
>I tried to reset the hsc_env.hsc_dflags.hiddenModules but without luck.
>Is there a trick to access the ModIface of hidden modules?
>
>
>## Including simplified core in .hi files by default
>
>In the cabal-audit flake, I am using a nix override to set the
>`-fwrite-if-simplified-core` ghc-options by default and to patch the
>ghc
>build phase to use the `+hi_core` hadrian transformers.
>
>To avoid rebuilding the dependencies, it would be great to have the
>simplified core in the hi file by default.
>Is there an issue or a downside when enabling the flag by default?
>Could the libraries shipped with GHC contains the simplified core in
>the
>future?
>
>
>## Declaration identifications
>
>In the
>[cabal-audit-command:CabalAudit.Command](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-command/src/CabalAudit/Command.hs)
>module, I implemented a proof of concept reverse lookup to find
>reachable declarations. For example using this command:
>
>```ShellSession
>$ cabal-audit-hi --target GHC.Exception.throw CabalAudit.Test.Simple
>base:GHC.Exception.throw
>|
>`- base:GHC.IO.Handle.Internals.ioe_finalizedHandle
>   |
>   `- base:GHC.IO.Handle.FD.$wstdHandleFinalizer
>      |
>      `- base:GHC.IO.Handle.FD.stdout
>         |
>         +- base:System.IO.putStrLn1
>         |  |
>         |  `- base:System.IO.putStrLn
>         |     |
>         |     `- cabal-audit-test:CabalAudit.Test.Simple.afficheNombre
>         |
>         `- base:System.IO.putStr1
>            |
>            `- base:System.IO.putStr
>               |
>               `- cabal-audit-test:CabalAudit.Test.Simple.maFonction
>```
>
>In the event a vulnerability happens in a type class instance, how to
>identify the affected instance?
>Instead of using 'package:Module.$fClassNameDataName', is there an
>established format we could use (for example "Typeclass X instance of
>T").
>
>What about data types or type families, would it makes sense to include
>them in the graph? If so, how to identify them in the advisory
>database?
>
>
>Please let me know if I miss something.
>Thanks for your time!
>-Tristan
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>ghc-devs mailing list
>ghc-devs@haskell.org
>http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Thread on Discourse - HIE file processing

David Christiansen

Tristan Cacqueray

Tristan Cacqueray

Sylvain Henry

Tristan Cacqueray

tags

participants (3)