Hackage reliability via mirroring

Duncan Coutts – Friday, 30 September 2016

TL;DR: Hackage now has multiple secure mirrors which can be used fully automatically by clients such as cabal.

In the last several years, as a community, we’ve come to greatly rely on services like Hackage and Stackage being available 24/7. There is always enormous frustration when either of these services goes down.

I think as a community we’ve also been raising our expectations. We’re all used to services like Google which appear to be completely reliable. Of course these are developed and operated by huge teams of professionals, whereas our community services are developed, maintained and operated by comparatively tiny teams on shoestring budgets.

A path to greater reliability

Nevertheless, reliability is important to us all, and so there has been a fair bit of effort put in over the last few years to improve reliability. I’ll talk primarily about Hackage since that is what I am familiar with.

Firstly, a couple years ago Hackage and haskell.org were moved from super-cheap VM hosting (where our machines tended to go down several times a year) to actually rather good quality hosting provided by Rackspace. Thanks to Rackspace for donating that, and the haskell.org infrastructure team for getting that organised and implemented. That in itself has made a huge difference: we’ve had far fewer incidents of downtime since then.

Obviously even with good quality hosting we’re still only one step away from unscheduled downtime, because the architecture is too centralised.

There were two approaches that people proposed. One was classic mirroring: spread things out over multiple mirrors for redundancy. The other proposal was to adjust the Hackage architecture somewhat so that while the main active Hackage server runs on some host, the the core Hackage archive would be placed on an ultra-reliable 3rd party service like AWS S3, so that this would stay available even if the main server was unavailable.

The approach we decided to take was the classic mirroring one. In some ways this is the harder path, but I think ultimately it gives the best results. This approach also tied in with the new security architecture (The Update Framework – TUF) that we were implementing. The TUF design includes mirrors and works in such a way that mirrors do not need to be trusted. If we (or rather end users) do not have to trust the operators of all the mirrors then this makes a mirroring approach much more secure and much easier to deploy.

Where we are today

The new system has been in beta for some time and we’re just short of flipping the switch for end users. The new Hackage security system in place on the server side, while on the client side, the latest release of cabal-install can be configured to use it, and the development version uses it by default.

There is lots to say about the security system, but that has (1, 2, 3) and will be covered elsewhere. This post is about mirroring.

For mirrors, we currently have two official public mirrors, and a third in the works. One mirror is operated by FP Complete and the other by Herbert Valerio Riedel. For now, Herbert and I manage the list of mirrors and we will be accepting contributions of further public mirrors. It is also possible to run private mirrors.

Once you are using a release of cabal-install that uses the new system then no further configuration is required to make use of the mirrors (or indeed the security). The list of public mirrors is published by the Hackage server (along with the security metadata) and cabal-install (and other clients using hackage-security) will automatically make use of them.

Reliability in the new system

Both of the initial mirrors are individually using rather reliable hosting. One is on AWS S3 and one on DreamHost S3. Indeed the weak point in the system is no longer the hosting. It is other factors like reliability of the hosts running the agents that do the mirroring, and the ever present possibility of human error.

The fact that the mirrors are hosted and operated independently is the key to improved reliability. We want to reduce the correlation of failures.

Failures in hosting can be mitigated by using multiple providers. Even AWS S3 goes down occasionally. Failures in the machines driving the mirroring are mitigated by using a normal decentralised pull design (rather than pushing out from the centre) and hosting the mirroring agents separately. Failures due to misconfiguration and other human errors are mitigated by having different mirrors operated independently by different people.

So all these failures can and will happen, but if they are not correlated and we have enough mirrors then the system overall can be quite reliable.

There is of course still the possibility that the upstream server goes down. It is annoying not to be able to upload new packages, but it is far more important that people be able to download packages. The mirrors mean there should be no interruption in the download service, and it gives the upstream server operators the breathing space to fix things.