I decided to take some time giving Erlang another hard look, and do some Bitcoin blockchain parsing in the process. Originally developed by Ericsson for use in their telephony products, Erlang is a functional programming language that is fundamentally built around concurrency. Designed for building large distributed and fault tolerant systems, and capable of taking advantage of multi-core architecture. I’ve spent most of my career in imperative and object-oriented languages, with only brief dalliances in the functional realm including a college level lisp assignment that went wrong quickly. Writing concurrent software is hard, anything that makes it easier definitely deserves a look.

Parsing the blockchain seemed like a good project, hairy enough to offer some challenge, but with enough leeway to let me dive into the unique strengths of Erlang, explore its I/O subsystem and wrap my head around the deeper structure. You can check out my code on Github. I’ve done some high level parsing, I’ve left the script and hex output, and real csv output as an exercise for the reader, send me patches. In many ways the language is structured more like an operating system than the more mainstream alternatives. Code modules run in isolates processes and communicate via explicit message passing with mailbox queues and selective receive. Similar to Unix style mailbox or piped communication where the receive side is a structured expression. This is a simple and powerful model for communication, the Unix way, I like it already.

If you want to try out the code, just clone the repository. Check out the README for more detailed install instructions, copy a block file or ten from ~/.bitcoin/blocks/dat?????.dat into the same directory and run make all; make shell.

make shell will drop you into an Erlang shell with the necessary modules loaded:

make shell
/usr/bin/rebar skip_deps=true compile
==> blockparser (compile)
==> blockparser (eunit)
Compiled src/blockparser_sup.erl
Compiled src/blockparser.erl
Compiled src/blockparser_worker.erl
  There were no tests to run.
Erlang/OTP 17 [erts-6.0] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V6.0  (abort with ^G)
1> application:start(poolboy).
ok
2> application:start(blockparser).
ok
3> blockparser:parse("blk00000.dat").

This will built a raw but parses blockchain text file at blk00000.dat.csv in the current working directory. For fun lets check out the process control tree and other runtime information:

4> observer:start().

System Overview Application view Process table

Application components are structured into modules with explicit interfaces, and all of these processes have limited mutable state. This has interesting ramifications with respect to performance and limiting potential side effects. Reproducibility given a set of inputs is an important concept, and in stark contrast to languages with mutable state where you can inadvertently generate different outputs to a defined set of inputs. The code is a little more complicated than it has to be because I decided to write the entire thing within the constraints of the Open Telecom Framework.

OTP is a major component of Erlang and provides a generic set of behaviours that model general functionality. The gen_server framework facilitates bi-direction process communication, the gen_fsm framework is all about building finite-state-machines, etc. Ultimately, these processes are grouped together into a supervision tree hierarchy. I am a huge fan of supervision trees. The DJB Daemon-tools framework is my go-to for building fault-tolerant processes under Linux, and these concepts are alive and well in Erlang. This allows portions of the system to be restarted on failure, and provide fault-isolation. The typical object-oriented program will fail if there is an error in a sub-module that isn’t caught. In OTP there is a way to model and recover from even serious faults in a distributed architecture.

Somehow I got this far without mentioning pattern matching, the idea that the right and left side of an equation are matched, and that variables are single-assignment. It sounds terrible, but in practice is rather elegant and easy to deal with. The built-in bit syntax is an example of the power of expression matching when applied to binary streams, an excerpt from blockparser_worker.erl:

extract(<< >>) -> ok;
extract(<<?MAGICBYTE:32/little, 
    HeaderLength:32/little,
    VersionNumber:32/little, 
    PreviousHash:256/bitstring, 
    MerkleRoot:256/bitstring, 
    TimeStamp:32/little, 
    TargetDifficulty:32/little, 
    Nonce:32/little,
    BinRest/binary>>) ->
   [TXCount, Tbin] = getVarInt(BinRest),
   [Tdata, _Rest] = getTransactions(TXCount, Tbin),
   {ok, {?MAGICBYTE, HeaderLength,
    VersionNumber, PreviousHash,
    MerkleRoot, TimeStamp,
    TargetDifficulty, Nonce,
    TXCount, Tdata}, _Rest};
extract(<<R:8, _Bin/binary>>) when R > 0 ->
    io:format("Problem: ~w~n", [binary:bin_to_list(_Bin, {0, 10})]),
    {scan, _Bin};
extract(Data) -> {scan, Data}.

The binary being read from the the on-disk block file matches a complex expression leading with a 32bit little endian encoded MagicByte as a start token. If the pattern doesn’t match a tuple {scan, Data} is returned which causes the parent function to bit-shift forward and try to match again. A very expressive piece of code that would be difficult to match in many languages. This focus on writing binary protocols. I’m also a fan of my implementation of variable-length integers:

getVarInt(<< TXCount:8, BinRest/binary >>) when TXCount < 253 -> [TXCount, BinRest];
getVarInt(<< 253:8, TXCount:16/little, BinRest/binary >>) -> [TXCount, BinRest];
getVarInt(<< 254:8, TXCount:32/little, BinRest/binary >>) -> [TXCount, BinRest];
getVarInt(<< 255:8, TXCount:64/little, BinRest/binary >>) -> [TXCount, BinRest];
getVarInt(_) -> error.

A series of functions with guard expressions and multiple returns, but also very concise. I probably should be returning a tuple, but that is a philosophical flame-war for another day. These are the functional roots of Erlang layered with a heavy dose of pragmatism. I think that is the design feature that sticks out more than anything else, Erlang was designed primarily to solve real world problems in a specific space, and those decisions have permeated the base libraries, frameworks, and general structure.

The astute reader will note that I am using poolboy to create a worker pool that handles long running parsing requests. This may seem right, but is actually wrong. I was under the impression that poolboy was both a worker pool and a work queue, it is decidedly the former. That means in order to leverage it properly I would be roped in to building a proper queuing solution. The right way to do this would be to create a dispatch process that loads the work queue, and then have a series of processes that read from the queue as appropriate. Erlang makes that type of interaction very easy. As it stands the code is fairly efficient, I spawn a process for each core and give each a sequential list of files to work through. The I/O in that situation was fairly fast. Earlier I had experimented with sending block updates to a logging process, but that queued a tremendous number of messages in the receive queue and ultimately resulted in poor write performance.

The long and short of it is that Erlang’s concurrency and bit syntax made short work of what would otherwise have been a fairly hairy experience. It has an uncanny structure to Unix itself, and is remarkably natural if you have a strong familiarity with those process and IPC mechanisms. Definitely give it a shot, after a few hours of utter frustration you will be glad you did.

Comments