Does Erlang/OTP need a new package management solution?

Since mid 2011 I’ve been thinking on and off about this question. There are some package management solutions available for Erlang/OTP already, but none of them really seem to meet my needs. I had been considering writing a new solution from the ground up, but decided to take a pause and engage with members of the open source community first. I reasoned that it’s better to build something that benefits the whole community and supports a wide range of user experiences, rather than just hack something together for my own use. Since the turn of the year, I’ve had some very constructive conversations with the Erlware developers, as well as some recent discussions on erlang-questions about this topic, with Joe Armstrong contributing to the pool of ideas. This post looks at the origin of these conversations, some of the driving forces, and concludes with a review of the direction in which the Erlware developers and I think we ought to consider going in.

TL;DR

Really, you mean you don’t want to read my essay of a blog post? I don’t blame you – I generally have the attention span of a small puppy, so here goes with the overview for those other ADHD folks out there:

we need/want a dependency manager, not a package manager
everything works on the command line
there is also an Erlang API for everything (with no assumptions about the runtime environment) so that tools integration is easy
packages need to be identified by name, version and publisher (e.g., basho/rebar-1.0 is not the same as hyperthunk/rebar-1.0)
multiple versions of packages (plus maybe multiple originators) may exist within a local machine so….
- the plain OTP lib_dir thing isn’t going to work (just having basho/rebar-1.0 and hyperthunk/rebar-1.0 installed isn’t viable), and
- reltool isn’t going to like this or provide a decent way of handling it
so we need some kind of custom local repository
- that understands publisher as an additional concept
- with a means of getting the right code path set up for any given task
storing package meta-data indexes in git is smart and easy
- one git repository per publisher/organisation is best, supporting index meta-data for any package they’re publishing
- users simply white-list publishers/organisations and get access to all their packages thereafter
- creating your own index (of packages) is easy and secure (via your github repository access control settings)
we prefer pushing built, binary artefacts rather than having the index(es) point to source code only repositories
the binary (+ mandatory stuff like includes and/or optional things like source code) artefacts should probably be bundled as .ez files
published binary (.ez) artefacts can and should just live inside the index, but pulling the index should not mean pulling the whole remote repo
- you never actually clone the whole repository unless you’re the publisher who owns and is working on and publishing to it
- the master branch is generally empty and contains only a README for the benefit of github browsing
- the index itself lives in a special ‘index’ branch and nothing but index metadata ever goes in here
- when a binary artefact is added, it is put on a new branch and tagged – all the index metadata is deleted from the branch/tag so only the artefact remains
- when pulling the index or a specific artefact that has been located by examining a local copy of the index, you fetch only that specific subset of the repository, by using git’s fetch-pack instruction set.

And for those who, rather than go on to read the following summation, would prefer trawling through the (very interesting!) conversations on the mailing list, you can go and search through the erlware questions group mailing list for things around package management. Here are some pointed conversations to browse at your leisure:

– original idea from Eric:
https://groups.google.com/forum/?fromgroups#!topic/erlware-questions/GtFBTQtgeng
– overview: https://groups.google.com/forum/?fromgroups#!topic/erlware-questions/omunsj8pfs4
– some of Joe’s questions from the erlang-questions package management thread that we visited:
https://groups.google.com/forum/?fromgroups#!topic/erlware-questions/ZbRdDAkFQPo
– repository design questions:
https://groups.google.com/forum/?fromgroups#!topic/erlware-questions/vNHjrvIScGE
– erlang/repository namespace issues:
https://groups.google.com/forum/?fromgroups#!topic/erlware-questions/cav3oK_D8sw
– code signing:
https://groups.google.com/forum/?fromgroups#!topic/erlware-questions/1esqRJU11EE

The idea of using git fetch-pack is illustrated towards the bottom of this post: https://groups.google.com/forum/?fromgroups#!topic/erlware-questions/js06abXa8Mk.

And so to the little details…

A quick aside…

In this post, I repeatedly refer to the packaging tool, as I’m very lazy. In fact, in the Erlware discussions, we’ve generally agreed that each of the following tasks could potentially be solved by a different program, with the tool chain working in an integrated fashion to provide a complete workflow:

managing local and/or remote repositories
solving dependencies
fetching dependencies/indexes
building
packaging/assembling
publishing

The user/developer experience

The first tenet of this proposed tool chain, is that every aspect of the workflow, for developers who’re consuming software, publishing it, or even just trying to build someone else’s kit – must be automated. The command line should suffice for everything, and the level of configuration required to make the tool(s) work should be minimal.

If I’m going to consume an OTP library/application, then I’d like this process to be *really* simple. If it’s a matter of fetching the software to my local machine, then I want the command to be something really simple, like ${tool} fetch {thing} or whatever. If I’m building a project and want to have this ‘dependency management tool’ integrated into my build, then I basically want a simple sequence like

specify the things I depend on
call the tool to fetch/install everything
compile/build my project (step ‘2’ might be integrated here instead)

Anybody who wants to build my project from source, should perform (2) and (3), or just (3) if the build tool is integrated nicely. That’s how it should be – simple. As a user wanting to deal with stage (1), most people will happily settle for one of

writing a file containing erlang terms, a la rebar/sinan config – or adding to an exiting build config of this ilk
running a command in the shell that lists the potential choices, and given a choice generates/updates the config for me

Either way, the file in which dependencies are declared must be human readable, and should not be hard to write by hand, which pretty much rules out JSON or XML or anything like that.

Declaring Dependencies and Managing Repositories

Obviously gathering dependencies requires that I know the application/library name and version. Some tools (like Cabal and RubyGems) support specifying a version based on some kind of range – that’s nice, but for now let’s put it to one side and assume the version number is going to be fixed. So to get hold of the lager logging framework, what should I declare?

%% dependencies.cfg
{lager, "0.9.4"}.

That’s nice and simple – no URL to worry about or anything like that. So assuming that lager-0.9.4 is not already available to me locally – we’ll cover what this means in practice later – how should a dependency manager resolve it to a viable list of packages? This is where the next assumption comes in – it shouldn’t. Perhaps OOTB some default source might be available – possibly provided by Erlware, or ProcessOne, Erlang Solutions or even Ericsson!? – but assuming it isn’t, the dependency manager should puke. You need to configure at least one source repository, so that the index of available packages can be downloaded/updated and searched for candidates.

One thing that many package management solutions have got right, is providing the ability to source software from multiple places. For the developer of a packaging solution, it is better not to have to maintain a canonical repository of source code (or built artefacts), simply because in order to verify their origin, you need some degree of manual intervention. This leads us to the second kind of user: those who’re packaging and publishing their software. Their experience ought to look something like this:

project is tagged, built and ready for publication
publisher runs the packaging tool [this is the end of the manual intervention required]
packaging tool generates appropriate meta-data for a repository index based on the project
packaging tool bundles everything up into an archive – probably a compressed .ez archive
packaging tool hashes the data in the archive and digitally signs the result using the publisher’s private key
packaging tool inserts the package meta-data into the local index
packaging tool places the package artefact into the local repository

The package is now ‘released’ into the local repository and ready to be uploaded so that others can benefit from it. This process should work with any project that uses an OTP layout (and with the help of some command line flags, also those that don’t) and therefore can be used by a developer to install any source code package or pre-packaged artefact into their local repository, even if they originally had to download it from bittorrent because it was never published anywhere else (properly).

So how does the consumer get hold of this package which is now installed on the publisher’s local machine? This part ought to be relatively easy. The packaging tool will obviously support a kind of `publish` operation, and the implementation of this can be incredibly simple.

Let’s assume that the local repository is implemented as a git repository. Let’s also assume that the index meta-data is stored in a simple file system layout (the details of which we’ll revisit later), and only this index meta-data – which describes the artefact, version, digital signature, MD5 for comparison post download, etc – is present in the ‘master’ branch. No other data lives in the main branch of the repository.

Assuming this is true, and given an artefact lager-0.9.4, writing the index meta-data might be as simple as:

$ cd $REPO
$ ls  # only one directory and a readme in here
.git index README.txt
$ mkdir -p index/lager/0.9.4
$ cat $LAGER_METADATA >> lager/0.9.4/index.meta
$ git add lager/0.9.4
$ git commit -m $LAGER_COMMIT_MSG

Clearly if we did a git push origin master now, we’d have an index that ‘claimed’ a lager-0.9.4 was present in the repository, when this isn’t the case. Now the underlying work the packaging tool must do looks something like:

$ git co -b lager_0.9.4
$ rm -drf index
$ git rm index
$ cp $LAGER_EZ_ARTEFACT .
$ git add $LAGER_EZ_ARTEFACT
$ git ci -m $LAGER_COMMIT_MSG
$ git tag -a lager-0.9.4 -m $LAGER_COMMIT_MSG
$ git push origin lager_0.9.4 --tags
$ git push origin master

And now there is both an index and an artefact uploaded to the remote repository.

Only Downloading What You Need

If you’ve been following this carefully, you’ll now be thinking about how the consumer does not want to download an entire repository full of applications/libraries, and many potential versions of each! It turns out that fetching only the parts that you want, is easy enough.

$ cd $REPO_INSTALL_AREA
$ mkdir -p $REQUIRED_ARTEFACT   # lager-0.9.4 in our case
$ cd $REQUIRED_ARTEFACT
$ git init
Initialized empty Git repository in /private/tmp/scratch-dir/.git/
$ git fetch-pack --include-tag -v git@github.com:hyperthunk/gitfoo.git refs/tags/lager-0.9.4
Server supports multi_ack_detailed
Server supports side-band-64k
Server supports ofs-delta
want c1bba117cc28e3c839a21d69e56af5768856930b (refs/tags/lager-0.9.4)
done
remote: Counting objects: 7, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 7 (delta 0), reused 7 (delta 0)
Unpacking objects: 100% (7/7), done.
c1bba117cc28e3c839a21d69e56af5768856930b refs/tags/lager-0.9.4
$ git archive c1bba117cc28e3c839a21d69e56af5768856930b >> lager-0.9.4.ez
$ ls -la
total 280
drwxr-xr-x   4 t4    st     136 28 May 21:55 .
drwxrwxrwt  11 root  st     374 28 May 21:52 ..
drwxr-xr-x  10 t4    st     340 28 May 21:54 .git
-rw-r--r--   1 t4    st  143360 28 May 21:55 lager-0.9.4.ez
$ unzip -l lager-0.9.4.ez 
Archive:  lager-0.9.4.ez
warning [lager-0.9.4.ez]:  1536 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  Length     Date   Time    Name
 --------    ----   ----    ----
        0  05-28-12 21:51   ebin/
    16168  05-28-12 21:51   ebin/error_logger_lager_h.beam
      937  05-28-12 21:51   ebin/lager.app
    10220  05-28-12 21:51   ebin/lager.beam
     3500  05-28-12 21:51   ebin/lager_app.beam
     3704  05-28-12 21:51   ebin/lager_console_backend.beam
    11228  05-28-12 21:51   ebin/lager_crash_log.beam
    14600  05-28-12 21:51   ebin/lager_file_backend.beam
    23060  05-28-12 21:51   ebin/lager_format.beam
     3384  05-28-12 21:51   ebin/lager_handler_watcher.beam
     1284  05-28-12 21:51   ebin/lager_handler_watcher_sup.beam
     3720  05-28-12 21:51   ebin/lager_mochiglobal.beam
    22096  05-28-12 21:51   ebin/lager_stdlib.beam
     2928  05-28-12 21:51   ebin/lager_sup.beam
     8244  05-28-12 21:51   ebin/lager_transform.beam
    12580  05-28-12 21:51   ebin/lager_trunc_io.beam
    13920  05-28-12 21:51   ebin/lager_util.beam
        0  11-07-11 11:20   include/
     3048  11-07-11 11:20   include/lager.hrl
    10175  11-07-11 11:20   LICENSE
     7639  11-07-11 11:20   README.org
 --------                   -------
   172435                   21 files
$

We can clearly see that in the scratch directory where we ran git init, we have acquired only the data from the tag the packaging tool created, which lives in its own artefact/version specific branch and is thus isolated from everything else in the repository. Each artefact version can be kept separate by the tool in this way, and all of them held apart from the repository’s searchable index, which is stored and maintained in the master branch.

This approach also solve the security conundrum, because only persons with ssh access to your github repository will be able to push changes. Those who wish to consume the data (i.e., check out the master-index and potentially fetch-pack some of the branches to obtain artefacts) may do so at will, but they cannot write back to the repository unless you’ve authorised them yourself.

Packaging Namespaces

One key issue we wanted to address was the tendency of open source projects – particularly those hosted on github – to be forked by multiple authors/maintainers. When consuming a OTP library or application, you may not care about this, but in order for more than one source repository to exist, we need to be able to distinguish between publishers!

If I have forked the lager application to my hyperthunk git account, and you have both my index/repository and the Basho Technologies repository listed as potential sources for resolving dependencies, then you’re going to have to get specific about which published version of lager-0.9.4 you actually want. We assume that you will do this by specifying the publisher/organisation along with each dependency, like so:

%% dependencies.cfg
{esl, parse_trans, latest}.
%{basho, lager, "0.9.4"}.
{hyperthunk, lager, "0.9.4"}.
{hyperthunk, annotations, "0.0.2"}.

The great thing about this approach, is that I (hyperthunk) do not need to actually fork the lager repository in order to publish it under my name! All I need to do is build the artefact and publish + push it to my repository. The code in my repository is signed with my private key, so if you trust me (and git’s ssh based security) then you can use my version of lager-0.9.4 if you wish. You can always obtain my public key (from the repository or elsewhere) in order to verify the integrity of the signed package, which is of course what the packaging tool will do for you when you fetch and install something.

Why is this so great? Well if some author decides not to publish a repository of their own, you can still rely on their code for your project, and treat it just like any other dependency. The mechanism for this kind of 3rd party signing is simple, and works like this:

fetch the unpublished code
build it yourself
publish it into your repo (self signing)
declare {your_organisation, dependency, vsn} in your config and you’re good to go!

So obviously the local repository needs to be split in two, one part which contains your own published stuff, another which contains dependencies you’ve fetched from other publishers and installed onto your local machine. If you never publish anything of your own, that part of the repo will simply be ignored.

The other reason why this is necessary, is that you might want to build Package-A which depends on hyperthunk/lager for one project, and Package-B which depends on basho/lager for another. When these projects are built (individually) they must have an isolated (clean) environment, so the following constraints need to be handled

the packaging tool must choose the right org/artefact/vsn and make it available on the code path for any relevant operations (compile, xref, dialyzer, eunit, etc)
the assembly tool must choose the right org/artefact/vsn items when generating a release – reltool is not going to understand how to do this
once the complete dependency DAG is generated, the solver must crash if two items with matching artefact names and versions exist – only one version of an OTP application will ever make it into the runtime via the code server, but having two clashing dependencies that differ only by publisher/organisation names is an error

Publishing Source Code or Binary Artefacts

I have a strong preference for publishing binary artefacts instead of source code that must be built. In a development environment, where there may well be multiple versions of Erlang/OTP installed, there is no getting around the fact the the beam emulator offers only 2 versions forwards compatibility, and none backwards. If you’ve compiled with R13, you’re probably ok up to R15, but you cannot use code compiled with Rn with any version Rx where x < n.

Because of this constraint, fetching source packages and building them locally doesn’t do you much good in practice, because you still have to track what erts version they’ve been built against to ensure they’re compatible. Sure you can rebuild a package once you realize that you need compatibility with an earlier or later runtime, but once you’ve dealt with this issue then you’re quite a way towards handling binary version compatibility (between erts releases) anyway.

Adding the erts version you built a binary package with to the publication meta-data is easy, and once that information is in the index, the solver can notify a user of R13 that the only published packages available are built for >= R14 . The user can then contact the publisher and ask for R13 packages to be provided, or resort to building the sources themselves and 3rd party signing them (or looking to see if someone else they trust has already done so). Either way, if the package meta-data carries information about the source repository and build command(s), this can easily be automated.

Having binary artefacts gets more involved when dealing with dependencies that contain native source code (for ports, linked-in drivers, NIFs and BIFs, etc). This has been discussed at length on erlware-questions and basically it boils down to the same issue as supporting multiple erts versions, except that you’ve got OS plus different kinds of architecture to deal with, leaving you with a more complex scheme for searching indexes and/or putting items into repositories:

$ erts-vsn/os/arch/32|64

In practice this does add complexity, and it’s highly likely many publishers will not bother to produce artefacts for various platforms/architectures. This is, of course, where 3rd party signing really shines once again.

Why not use [X] instead?

Finally, I’d like to address what kind of ‘package management’ tools we’ve been discussed on the Erlware mailing list. The conversation(s) inevitably started out with expressed dissatisfaction about the current solutions that are available to solve the problem of getting code onto your machine. We quickly noted that most of the discussion centered around activities that take place at build/development time, the focal point being how to obtain working dependencies when building a complex software project in Erlang/OTP. To my mind, this immediately takes us out of the traditional ‘package management’ territory, where the primary concern has to do with installing version X of package Y into the local environment. This also puts us outside of the ideas Joe Armstrong has put forward about remote code loading and importing code from URLs and the like. These are very good ideas – just go look at Smalltalk to see how well they can work – but they’re out of scope for most of today’s tools and probably not going to surface in the near/immediate future.

The two major players at the moment appear to be rebar and agner, although rebar is of course a build tool at heart and not a package manager. The approach that rebar takes is definitely closer to what I’d call a ‘dependency manager’, in that it supports the declaration of software component dependencies in static configuration, and provides a command line interface for fetching, updating and/or removing these from the local source tree of the project in which the declaring config resides. Once these dependencies have been fetched, they are thereafter treated like part of the project’s own source tree and are built (i.e., compiled, tested, cleaned, etc) along with the project itself. As rebar is a tool for building OTP compliant software packages, any dependency must also be a valid OTP application (or library application) in order for this mechanism to work. The approach that agner takes is similar, fetching, building and installing OTP applications/libraries into the code:lib_dir of the current Erlang/OTP installation, or an alternative site. It [agner] also supports upgrading and removing them if required. The mechanism agner uses is more complete than rebar’s simple approach based on VCS URL and optional commit/tag/branch, allowing the publisher to specify details about the package that ease the pain of consumption. In order to support applications that aren’t rebar compatible, agner allows an explicit build command to be specified, which is executed in an external shell process.

What else is out there right now? There is Jacob Vorreuter’s Erlang Package Manager. This is actually a great bit of kit, but suffers from problems with rate limiting due to it’s use of the github search API instead of an explicit package index. Jake published another package management solution later on, in the form of sutro, which is inspired by Homebrew (and works in a similar vein).

The key issue we saw with rebar‘s dependency handling approach was that it only works for rebar – so it is of no use to projects using sinan or some other build system such as waf, fw-erlang or more traditional autotools an/or make based projects such as Ejabberd and RabbitMQ. The use of a local directory to store dependencies also makes this approach a menace when you’ve got dozens of little projects which all depend on the same libraries, as they end up littering the machine. This approach however, was put in place for a good reason, and it does avoid running into problems where globally installed components can lead to unexpected clashes on the code path, incorrect versions being resolved or other problems inherent in shared/global environments. This idea of isolation is very important to maintaining a clean development environment for each build of each project you work on, as evidenced by the excellent virutal-env tool for Python and similar tools for Ruby (such as rvm and rbenv) and Haskell’s virtual-env clone.

Clean, isolated build environments are essential to maintaining a productive development life-cycle and even more vital for things like CI.

The main reason we found ourselves not using agner was the dual cost of maintaining indexes and searching them. The former is a relatively minor pain, but the latter is excruciating due to general slowness.

open source community

This entry was posted on May 28, 2012, 10:43 pm and is filed under builds, erlang, rebar. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

#1 by Yuri on March 16, 2013 - 7:56 pm

Hi,

These are great ideas! I really like how deep you go into the details.
Have you implemented some prototypes? Is there some work in progress regarding this or hasn’t this gone any further than just discussing?

#2 by Hyperthunk on March 18, 2013 - 12:11 pm

Hi Yuri, I spent a little time prototyping some of this, but I quickly parked the ideas as I wasn’t convinced at the time that there was sufficient interest in the community to make it work. It seems that Erlang Solutions are now putting some time and effort into similar ideas, so in the first instance, I’d suggest reaching out on the erlang-questions mailing list to see if anything is happening. I remain loosely involved in rebar (from a distance atm) but I’ve not been actively pursuing this particular area for the last year.

HyperThunk