TL;DR: Upcoming HLS releases will be substantially faster and more responsive for large codebases using Template Haskell, thanks to work by Well-Typed on behalf of Mercury.
Haskell Language Server (HLS) is a key part of the IDE experience for Haskell. Well-Typed are actively exploring ways to maintain and improve HLS, in particular making it robust and performant on large projects.
Recently, Mercury asked us to deal with performance problems in HLS that were preventing their developers from using it effectively. In particular, for large projects using Template Haskell, HLS would take a long time to start up, and would spend far too much time recompiling after each edit to a file. This made it difficult to use interactively with such projects.
This post describes the progress we have made in improving performance in GHC and HLS for projects using Template Haskell. The work has been primarily completed by Zubin Duggal with help from Matthew Pickering. It involves three key changes:
Speeding up recompilation by reducing how often GHC and HLS need to recompile modules after their dependencies are changed.
Reducing start-up times by serializing Core code to interface files.
Taking advantage of the serialized Core to generate bytecode only when really needed.
We will discuss each of these in turn, followed by benchmarks illustrating the impact of the changes.
Speeding up recompilation given Template Haskell
For interactive development with HLS, a project is fully checked once on startup. Subsequently, when the developer edits a file, that module and any modules that depend on it may then need to be re-checked. In the absence of Template Haskell, HLS can type-check a project and then stop without actually doing code generation, thereby avoiding lots of computation. However, when Template Haskell is used, functions used in TH splices need to be compiled to bytecode or object code in order to run the splices and continue type-checking.
GHC implements a “recompilation check” that decides whether a given module needs to be recompiled when one of its dependencies has changed. For example, if the only function that has changed is never used in the module, there is no need to recompile. This can save substantial amounts of work.
However, in GHC 9.2 and previous versions, where a module enables Template Haskell, that module will be recompiled whenever any of its dependencies are modified at all. The key requirement is that if the definition of any identifier used in a splice is modified, then we need to recompile. By eagerly recompiling if anything changes at all, GHC will always catch the case when something used in a splice is changed, but it’s not a very precise solution, and leads to many unnecessary recompiles.
The crux of the improved recompilation performance is to recompile only if a module which defines a symbol used in a splice is changed, rather than any module at all. We have implemented this scheme in both GHC and HLS:
The HLS patch will be available in the next HLS release, and will apply when using HLS with any supported GHC version.
The GHC patch will be part of the GHC 9.4 release, and will apply when compiling outside of HLS with GHC or GHCi.
Fine-grained recompilation avoidance in HLS
HLS uses its own build system,
hls-graph
, based on Shake but
adapted to suit the needs of HLS (e.g. storing everything graph in memory rather
than on disk, and minimizing rebuilds). The build system defines rules that
specify when changes to one build product may affect another. However, these
rules are necessarily somewhat coarser-grained than the fine-grained
recompilation avoidance check implemented by GHC.
Previously, the build system could reuse old build products only if the rule wasn’t triggered at all, or if old build products were already available on disk. This meant that even if GHC’s fine-grained recompilation avoidance check knew that recompilation was not required, HLS would still have to recompile in order to have the build products around for subsequently running splices.
Thus another crucial improvement to speed up recompilation is to allow the build rules to have access to the value computed on a previous run. Then when a rule is rebuilt, it can first invoke GHC’s fine-grained recompilation check, and if that indicates no recompilation is necessary, immediately return the previous value.
Speeding up start-up with serialized Core
We have discussed recompilation performance during editing, but what about when HLS starts up?
If the project been compiled previously, then in the absence of
Template Haskell, HLS will load interface (.hi
) files from disk in order to
recreate the internal state without much additional work.
However, for modules which use Template Haskell, the compiled bytecode or object code is needed to run splices, in addition to the contents of the interface files. Prior to our work, HLS could not make use of existing compiled code, so it had to recompile everything from scratch, leading to long start-up times every time the project was opened.
How can we avoid this? It turns out to be difficult to use either bytecode1 or object code2 in this context, so we chose a third option: after compiling modules, serialize the Core to disk as part of a “fat interface file”, and use this to generate bytecode for the modules we need on startup. Core is GHC’s intermediate language, so generating bytecode from Core is not a particularly expensive operation. Moreover, the machinery needed to serialize Core already exists in GHC, because GHC serializes some Core for unfoldings into interface files. Thus it was easy to add fat interface file support into HLS while maintaining support for all the GHC versions it currently compiles with.
This significantly reduces start-up times on codebases that use a lot of Template Haskell (provided the code has been compiled once before). This change has draft implementations in progress for HLS, and in GHC itself.
Generating bytecode on demand
Usually, before typechecking a module with TH and running its splices, HLS has to populate the environment it passes to GHC with the compiled bytecode or object code for all the modules that may be used in a splice (i.e. all the imports of the module).
However, by installing a GHC hook, we can intercept splices just before they are compiled and run, and inspect them to discover precisely which modules are actually used in a particular splice. Once we have the set of modules that a splice actually depends on, we can populate the GHC environment by compiling the bytecode of exactly that set of modules and no more. The set of modules used in TH splices is usually much smaller than the total set of imports, so this avoids having to compile many modules to bytecode.
This optimisation is only made possible with the ability to serialize Core, as we can generate bytecode on demand using the Core file we have written out. Without the serialized Core, we would have to keep the intermediate typechecked or Core ASTs of all modules in memory, which is not feasible due to the heap bloat this would cause. HLS has invariants that try really hard to ensure that intermediate ASTs are only kept in memory for files that the user has opened in their editor, as doing so for every single file in the project is infeasible for projects that span over a few dozen modules.
We have a work-in-progress implementation of this change in HLS, but a few issues remain to be ironed out before it is ready to be reviewed and merged. Doing this in GHC would be more difficult; a possible alternative would be to allow the user to control which identifiers can be used in splices via Explicit Splice Imports.
Benchmarks
We conducted some benchmarks on a large commercial Haskell repository, which consists of around 2000 modules and approximately 200,000 lines of code. These benchmarks used the existing HLS benchmarking infrastructure to start HLS, edit files and perform HLS requests (e.g. to look up a definition or get information about a source code location).
The HLS versions being compared are:
- HLS prior to these changes (
baseline
), - the new recompilation avoidance scheme (
avoid-recompile
), - the strategy to serialize Core along with the new recompilation avoidance scheme (
serialize-core
), and - the optimisation to
serialize-core
to allow us to generate bytecode only on demand (on-demand-bytecode
).
All benchmarks are conducted with a warm cache (i.e. the project has been
compiled at least once in the past, so its .hi
files are present on disk).
Set 1: Comparison against baseline
This benchmark consists of edits to 3 files, followed by a “get definition” request in each of the files, repeated 20 times. The 3 files chosen consisted of one file relatively deep in the module hierarchy, one in the upper two-thirds, and another one near the very top. These were chosen as to provide a realistic accounting of the kinds of edits users might perform in an actual IDE session, taking into account a diverse set of files from across the entire module hierarchy.
version | baseline | avoid-recompile | serialize-core | on-demand-bytecode |
---|---|---|---|---|
time waiting for responses | 448.5s | 423.8s | 80.1s | 36.2s |
time after responding | 10852.4s | 503.3s | 315.0s | 98.3s |
total time | 11300.9s | 927.3s | 395.2s | 134.4s |
initial load time | 447.6 | 429.2s | 84.8s | 46.9s |
average time per response | 0.019s | 0.010s | 0.011s | 0.008s |
GHC rebuilds | 21680 | 2238 | 339 | 339 |
max residency | 7093MiB | 4903MiB | 3937MiB | 3078MiB |
bytes allocated | 3817GiB | 676GiB | 533GiB | 254GiB |
The time measurements are wall-clock times:
- “time waiting for responses” is the sum of the times between requests being issued and the user seeing responses to their requests (including the initial load in order to respond to the first request).
- “time after responding” is the sum of the time spent after responses have been returned but HLS is still busy (e.g. recompiling after an edit).
- “total time” is the time taken to run the entire benchmark.
- “initial load time” is the time taken for HLS to load the project and come back to idle.
- “average time per response” is the average amount of time HLS took to respond to a request (not counting the initial load).
- “GHC rebuilds” counts the total number of times HLS called into the GHC API to rebuild a module.
The total time and number of GHC rebuilds for baseline
is vastly higher
than the others, because without the recompilation changes we are essentially
compiling the entire project on every edit to the file deep in the module
hierarchy, while avoid-recompile
or serialize-core
do a comparatively minimal
amount of recompilation work on every edit. The actual factor of improvement is
not very meaningful (it could be made arbitrarily high by increasing the number
of iterations). However, it does show a very significant improvement to the user
experience of the IDE: getting up to date information, warnings and errors much
faster compared to the status quo.
Between avoid-recompile
and serialize-core
the total time decreases further,
because on a fresh start, the files can be loaded directly from disk rather
than recompiling. Looking at the “GHC rebuilds” column, avoid-recompile
needs to rebuild every file once at startup, and then do a minimal amount of
recompilation on edits. serialize-core
and on-demand-bytecode
have to do some much smaller
amount of recompilation on startup due to editing a file deep in the hierarchy,
and do the same amount of recompilation due to edits as avoid-recompile
.
With on-demand-bytecode
, we see another dramatic
improvement to initial load times as we we can avoid compiling many files to
bytecode. We also see a dramatic improvement in the total time as we avoid all
this work even on recompiles.
Looking at a graph of heap usage against time shows the dramatic impact of the changes:
Set 2: Impact of serializing Core and generating bytecode on demand
This set of varied benchmarks compares just the avoid-recompile
,
serialize-core
and on-demand-bytecode
changes, since they are all
substantially better than baseline
. These benchmarks consisted of 50
repetitions each.
This time, only the two files near the top were edited as part of the benchmark, because the third file is relatively deep in the module hierarchy, and editing it invalidates a large amount of build products we have on disk, somewhat damping the effect of faster startups due to serialized Core.
Full results are given in the Appendix below. The heap usage vs time graph for some of these benchmarks is also included, for example here is the graph for the first benchmark in this set:
The initial upward sloping segment shows us the initial load, and the flat,
spiky segment is the section where we are editing, performing requests and
recompiling. The vertical lines show the times at which the initial load is
complete. It is clear that the initial load times are vastly improved with the
serialize-core
changes, going from consistently over 300s to under 100s, and
further improved by the on-demand-bytecode
changes, further reducing to around
60s.
Of course, in practice, the impact of these improvements differs quite heavily depending on usage. If the first thing done by the user is to open and edit something deep in the module hierarchy upon which a lot of things depend, the improvements can be quite minimal. If on the other hand something at the top of the module hierarchy is opened first, startup times would be greatly improved because we wouldn’t need to recompile anything below it.
Conclusion
We are grateful to Mercury for funding this work, as it will benefit the whole Haskell community. We have already made significant progress improving the performance of HLS, and are continuing to identify further opportunities for performance improvements.
Well-Typed are actively looking for funding to continue maintaining and enhancing HLS and GHC. If your company relies on GHC or HLS, and you could support this work, or would like help improving the developer experience for your Haskell engineers, please get in touch with us via info@well-typed.com!
Appendix: benchmark results
Hover after edit
version | avoid-recompile | serialize-core | on-demand-bytecode |
---|---|---|---|
time waiting for responses | 324.6s | 94.3s | 48.4s |
time after responding | 295.0s | 336.2s | 102.5s |
total time | 619.5s | 430.5s | 151.0s |
initial load time | 324.2s | 98.5s | 52.2s |
average time per response | 0.005s | 0.005s | 0.006s |
GHC rebuilds | 1861 | 339 | 339 |
max residency | 4697MiB | 3777MiB | 2787MiB |
bytes allocated | 595GiB | 533GiB | 178GiB |
getDefinition after edit
version | avoid-recompile | serialize-core | on-demand-bytecode |
---|---|---|---|
time waiting for responses | 456.0s | 92.9s | 93.1s |
time after responding | 387.8s | 315.3s | 60.4s |
total time | 843.8s | 408.1s | 153.5s |
initial load time | 450.2s | 89.5s | 63.2s |
average time per response | 0.05s | 0.04s | 0.04s |
GHC rebuilds | 1861 | 339 | 339 |
max residency | 4728MiB | 3762MiB | 2887MiB |
bytes allocated | 589GiB | 385GiB | 178GiB |
Completions after edit
version | avoid-recompile | serialize-core | on-demand-bytecode |
---|---|---|---|
time waiting for responses | 785.2s | 440.5s | 145.3s |
time after responding | 247.7s | 233.3s | 24.2s |
total time | 1032.9s | 673.8s | 169.5s |
initial load time | 443.9s | 92.0s | 62.0s |
average time per response | 1.25s | 1.24s | 0.94s |
GHC rebuilds | 1861 | 339 | 339 |
max residency | 4982MiB | 4012MiB | 3008MiB |
bytes allocated | 816GiB | 598GiB | 198iB |
Code actions after edit
version | avoid-recompile | serialize-core | on-demand-bytecode |
---|---|---|---|
time waiting for responses | 479.8s | 426.1s | 139.3s |
time after responding | 317.8s | 72.3s | 0.017s |
total time | 794.6s | 494.4s | 139.3s |
initial load time | 314.9s | 72.9s | 60.7s |
average time per response | 4.05s | 4.34s | 0.80s |
GHC rebuilds | 1861 | 339 | 339 |
max residency | 4990MiB | 3983MiB | 2981MiB |
bytes allocated | 685GiB | 468GiB | 241GiB |
Footnotes
Directly serializing bytecode is difficult to implement because the bytecode format hasn’t been very stable across the GHC versions HLS supports, and isn’t designed for easy serialization as it contains many references to things that exist only in memory.↩︎
Unlike bytecode, object code already comes with a serializable format, but has several other drawbacks:
- Dynamically loaded code is difficult to unload and practically impossible on some platforms. This can make a big difference with interactive use, as memory usage can grow linearly with edits (older object code is never truly unloaded, newer code just shadows the old code). In addition to this, bugs with code unloading on some GHC versions also posed issues.
- Generating and emitting native code is usually much slower than bytecode, and even though the emitted code may be faster, the cost of the former usually dominates the latter when it comes to the small, usually quite simple types of programs that are executed due to Template Haskell splices
- Linking object code can also be quite expensive.