Performance improvements for HLS

Zubin Duggal, Matthew Pickering, Adam Gundry – Friday, 01 April 2022

all build-systems ghc hls mercury performance

TL;DR: Upcoming HLS releases will be substantially faster and more responsive for large codebases using Template Haskell, thanks to work by Well-Typed on behalf of Mercury.

Haskell Language Server (HLS) is a key part of the IDE experience for Haskell. Well-Typed are actively exploring ways to maintain and improve HLS, in particular making it robust and performant on large projects.

Recently, Mercury asked us to deal with performance problems in HLS that were preventing their developers from using it effectively. In particular, for large projects using Template Haskell, HLS would take a long time to start up, and would spend far too much time recompiling after each edit to a file. This made it difficult to use interactively with such projects.

This post describes the progress we have made in improving performance in GHC and HLS for projects using Template Haskell. The work has been primarily completed by Zubin Duggal with help from Matthew Pickering. It involves three key changes:

Speeding up recompilation by reducing how often GHC and HLS need to recompile modules after their dependencies are changed.
Reducing start-up times by serializing Core code to interface files.
Taking advantage of the serialized Core to generate bytecode only when really needed.

We will discuss each of these in turn, followed by benchmarks illustrating the impact of the changes.

Speeding up recompilation given Template Haskell

For interactive development with HLS, a project is fully checked once on startup. Subsequently, when the developer edits a file, that module and any modules that depend on it may then need to be re-checked. In the absence of Template Haskell, HLS can type-check a project and then stop without actually doing code generation, thereby avoiding lots of computation. However, when Template Haskell is used, functions used in TH splices need to be compiled to bytecode or object code in order to run the splices and continue type-checking.

GHC implements a “recompilation check” that decides whether a given module needs to be recompiled when one of its dependencies has changed. For example, if the only function that has changed is never used in the module, there is no need to recompile. This can save substantial amounts of work.

However, in GHC 9.2 and previous versions, where a module enables Template Haskell, that module will be recompiled whenever any of its dependencies are modified at all. The key requirement is that if the definition of any identifier used in a splice is modified, then we need to recompile. By eagerly recompiling if anything changes at all, GHC will always catch the case when something used in a splice is changed, but it’s not a very precise solution, and leads to many unnecessary recompiles.

The crux of the improved recompilation performance is to recompile only if a module which defines a symbol used in a splice is changed, rather than any module at all. We have implemented this scheme in both GHC and HLS:

The HLS patch will be available in the next HLS release, and will apply when using HLS with any supported GHC version.
The GHC patch will be part of the GHC 9.4 release, and will apply when compiling outside of HLS with GHC or GHCi.

Fine-grained recompilation avoidance in HLS

HLS uses its own build system, hls-graph, based on Shake but adapted to suit the needs of HLS (e.g. storing everything graph in memory rather than on disk, and minimizing rebuilds). The build system defines rules that specify when changes to one build product may affect another. However, these rules are necessarily somewhat coarser-grained than the fine-grained recompilation avoidance check implemented by GHC.

Previously, the build system could reuse old build products only if the rule wasn’t triggered at all, or if old build products were already available on disk. This meant that even if GHC’s fine-grained recompilation avoidance check knew that recompilation was not required, HLS would still have to recompile in order to have the build products around for subsequently running splices.

Thus another crucial improvement to speed up recompilation is to allow the build rules to have access to the value computed on a previous run. Then when a rule is rebuilt, it can first invoke GHC’s fine-grained recompilation check, and if that indicates no recompilation is necessary, immediately return the previous value.

Speeding up start-up with serialized Core

We have discussed recompilation performance during editing, but what about when HLS starts up?

If the project been compiled previously, then in the absence of Template Haskell, HLS will load interface (.hi) files from disk in order to recreate the internal state without much additional work.

However, for modules which use Template Haskell, the compiled bytecode or object code is needed to run splices, in addition to the contents of the interface files. Prior to our work, HLS could not make use of existing compiled code, so it had to recompile everything from scratch, leading to long start-up times every time the project was opened.

How can we avoid this? It turns out to be difficult to use either bytecode¹ or object code² in this context, so we chose a third option: after compiling modules, serialize the Core to disk as part of a “fat interface file”, and use this to generate bytecode for the modules we need on startup. Core is GHC’s intermediate language, so generating bytecode from Core is not a particularly expensive operation. Moreover, the machinery needed to serialize Core already exists in GHC, because GHC serializes some Core for unfoldings into interface files. Thus it was easy to add fat interface file support into HLS while maintaining support for all the GHC versions it currently compiles with.

This significantly reduces start-up times on codebases that use a lot of Template Haskell (provided the code has been compiled once before). This change has draft implementations in progress for HLS, and in GHC itself.

Generating bytecode on demand

Usually, before typechecking a module with TH and running its splices, HLS has to populate the environment it passes to GHC with the compiled bytecode or object code for all the modules that may be used in a splice (i.e. all the imports of the module).

However, by installing a GHC hook, we can intercept splices just before they are compiled and run, and inspect them to discover precisely which modules are actually used in a particular splice. Once we have the set of modules that a splice actually depends on, we can populate the GHC environment by compiling the bytecode of exactly that set of modules and no more. The set of modules used in TH splices is usually much smaller than the total set of imports, so this avoids having to compile many modules to bytecode.

This optimisation is only made possible with the ability to serialize Core, as we can generate bytecode on demand using the Core file we have written out. Without the serialized Core, we would have to keep the intermediate typechecked or Core ASTs of all modules in memory, which is not feasible due to the heap bloat this would cause. HLS has invariants that try really hard to ensure that intermediate ASTs are only kept in memory for files that the user has opened in their editor, as doing so for every single file in the project is infeasible for projects that span over a few dozen modules.

We have a work-in-progress implementation of this change in HLS, but a few issues remain to be ironed out before it is ready to be reviewed and merged. Doing this in GHC would be more difficult; a possible alternative would be to allow the user to control which identifiers can be used in splices via Explicit Splice Imports.

Benchmarks

We conducted some benchmarks on a large commercial Haskell repository, which consists of around 2000 modules and approximately 200,000 lines of code. These benchmarks used the existing HLS benchmarking infrastructure to start HLS, edit files and perform HLS requests (e.g. to look up a definition or get information about a source code location).

The HLS versions being compared are:

HLS prior to these changes (baseline),
the new recompilation avoidance scheme (avoid-recompile),
the strategy to serialize Core along with the new recompilation avoidance scheme (serialize-core), and
the optimisation to serialize-core to allow us to generate bytecode only on demand (on-demand-bytecode).

All benchmarks are conducted with a warm cache (i.e. the project has been compiled at least once in the past, so its .hi files are present on disk).

Set 1: Comparison against baseline

This benchmark consists of edits to 3 files, followed by a “get definition” request in each of the files, repeated 20 times. The 3 files chosen consisted of one file relatively deep in the module hierarchy, one in the upper two-thirds, and another one near the very top. These were chosen as to provide a realistic accounting of the kinds of edits users might perform in an actual IDE session, taking into account a diverse set of files from across the entire module hierarchy.

version	baseline	avoid-recompile	serialize-core	on-demand-bytecode
time waiting for responses	448.5s	423.8s	80.1s	36.2s
time after responding	10852.4s	503.3s	315.0s	98.3s
total time	11300.9s	927.3s	395.2s	134.4s
initial load time	447.6	429.2s	84.8s	46.9s
average time per response	0.019s	0.010s	0.011s	0.008s
GHC rebuilds	21680	2238	339	339
max residency	7093MiB	4903MiB	3937MiB	3078MiB
bytes allocated	3817GiB	676GiB	533GiB	254GiB

The time measurements are wall-clock times:

“time waiting for responses” is the sum of the times between requests being issued and the user seeing responses to their requests (including the initial load in order to respond to the first request).
“time after responding” is the sum of the time spent after responses have been returned but HLS is still busy (e.g. recompiling after an edit).
“total time” is the time taken to run the entire benchmark.
“initial load time” is the time taken for HLS to load the project and come back to idle.
“average time per response” is the average amount of time HLS took to respond to a request (not counting the initial load).
“GHC rebuilds” counts the total number of times HLS called into the GHC API to rebuild a module.

The total time and number of GHC rebuilds for baseline is vastly higher than the others, because without the recompilation changes we are essentially compiling the entire project on every edit to the file deep in the module hierarchy, while avoid-recompile or serialize-core do a comparatively minimal amount of recompilation work on every edit. The actual factor of improvement is not very meaningful (it could be made arbitrarily high by increasing the number of iterations). However, it does show a very significant improvement to the user experience of the IDE: getting up to date information, warnings and errors much faster compared to the status quo.

Between avoid-recompile and serialize-core the total time decreases further, because on a fresh start, the files can be loaded directly from disk rather than recompiling. Looking at the “GHC rebuilds” column, avoid-recompile needs to rebuild every file once at startup, and then do a minimal amount of recompilation on edits. serialize-core and on-demand-bytecode have to do some much smaller amount of recompilation on startup due to editing a file deep in the hierarchy, and do the same amount of recompilation due to edits as avoid-recompile.

With on-demand-bytecode, we see another dramatic improvement to initial load times as we we can avoid compiling many files to bytecode. We also see a dramatic improvement in the total time as we avoid all this work even on recompiles.

Looking at a graph of heap usage against time shows the dramatic impact of the changes:

“getDefinition after edit” - live bytes over time

Set 2: Impact of serializing Core and generating bytecode on demand

This set of varied benchmarks compares just the avoid-recompile, serialize-core and on-demand-bytecode changes, since they are all substantially better than baseline. These benchmarks consisted of 50 repetitions each.

This time, only the two files near the top were edited as part of the benchmark, because the third file is relatively deep in the module hierarchy, and editing it invalidates a large amount of build products we have on disk, somewhat damping the effect of faster startups due to serialized Core.

Full results are given in the Appendix below. The heap usage vs time graph for some of these benchmarks is also included, for example here is the graph for the first benchmark in this set:

“hover after edit” - live bytes over time

The initial upward sloping segment shows us the initial load, and the flat, spiky segment is the section where we are editing, performing requests and recompiling. The vertical lines show the times at which the initial load is complete. It is clear that the initial load times are vastly improved with the serialize-core changes, going from consistently over 300s to under 100s, and further improved by the on-demand-bytecode changes, further reducing to around 60s.

Of course, in practice, the impact of these improvements differs quite heavily depending on usage. If the first thing done by the user is to open and edit something deep in the module hierarchy upon which a lot of things depend, the improvements can be quite minimal. If on the other hand something at the top of the module hierarchy is opened first, startup times would be greatly improved because we wouldn’t need to recompile anything below it.

Conclusion

We are grateful to Mercury for funding this work, as it will benefit the whole Haskell community. We have already made significant progress improving the performance of HLS, and are continuing to identify further opportunities for performance improvements.

Well-Typed are actively looking for funding to continue maintaining and enhancing HLS and GHC. If your company relies on GHC or HLS, and you could support this work, or would like help improving the developer experience for your Haskell engineers, please get in touch with us via info@well-typed.com!

Appendix: benchmark results

Hover after edit

version	avoid-recompile	serialize-core	on-demand-bytecode
time waiting for responses	324.6s	94.3s	48.4s
time after responding	295.0s	336.2s	102.5s
total time	619.5s	430.5s	151.0s
initial load time	324.2s	98.5s	52.2s
average time per response	0.005s	0.005s	0.006s
GHC rebuilds	1861	339	339
max residency	4697MiB	3777MiB	2787MiB
bytes allocated	595GiB	533GiB	178GiB

getDefinition after edit

version	avoid-recompile	serialize-core	on-demand-bytecode
time waiting for responses	456.0s	92.9s	93.1s
time after responding	387.8s	315.3s	60.4s
total time	843.8s	408.1s	153.5s
initial load time	450.2s	89.5s	63.2s
average time per response	0.05s	0.04s	0.04s
GHC rebuilds	1861	339	339
max residency	4728MiB	3762MiB	2887MiB
bytes allocated	589GiB	385GiB	178GiB

Completions after edit

version	avoid-recompile	serialize-core	on-demand-bytecode
time waiting for responses	785.2s	440.5s	145.3s
time after responding	247.7s	233.3s	24.2s
total time	1032.9s	673.8s	169.5s
initial load time	443.9s	92.0s	62.0s
average time per response	1.25s	1.24s	0.94s
GHC rebuilds	1861	339	339
max residency	4982MiB	4012MiB	3008MiB
bytes allocated	816GiB	598GiB	198iB

Code actions after edit

version	avoid-recompile	serialize-core	on-demand-bytecode
time waiting for responses	479.8s	426.1s	139.3s
time after responding	317.8s	72.3s	0.017s
total time	794.6s	494.4s	139.3s
initial load time	314.9s	72.9s	60.7s
average time per response	4.05s	4.34s	0.80s
GHC rebuilds	1861	339	339
max residency	4990MiB	3983MiB	2981MiB
bytes allocated	685GiB	468GiB	241GiB

Footnotes

Directly serializing bytecode is difficult to implement because the bytecode format hasn’t been very stable across the GHC versions HLS supports, and isn’t designed for easy serialization as it contains many references to things that exist only in memory.↩︎
Unlike bytecode, object code already comes with a serializable format, but has several other drawbacks:
1. Dynamically loaded code is difficult to unload and practically impossible on some platforms. This can make a big difference with interactive use, as memory usage can grow linearly with edits (older object code is never truly unloaded, newer code just shadows the old code). In addition to this, bugs with code unloading on some GHC versions also posed issues.
2. Generating and emitting native code is usually much slower than bytecode, and even though the emitted code may be faster, the cost of the former usually dominates the latter when it comes to the small, usually quite simple types of programs that are executed due to Template Haskell splices
3. Linking object code can also be quite expensive.
↩︎