Disco, using Erlang to Implement MapReduce

Categories:

cufp2011, Erlang

Jared Flatow, Prashanth Mundkur, Ville Tuulos We will describe our experiences using Erlang within Nokia to build Disco [1], a lean and flexible MapReduce framework for large-scale data analysis that scales to large clusters and is used in production at Nokia.

Disco is an open-source project that was started in 2008 when attempts to use Hadoop to analyze data proved to be a painful experience. The MapReduce step formed only a portion of the analytics stack, and it was felt that it would be faster to write a custom implementation that would integrate well, than adapt Hadoop with the amount of internal Hadoop expertise available. Among the crucial tasks of such an implementation would be to deal with cluster monitoring, fault- tolerance, and the management and scheduling of a large number of concurrent and distributed jobs. To keep the implementation simple, the use of a platform that provided first-class support for distribution and concurrency was imperative. This motivated the choice of Erlang/OTP to implement the core control plane of Disco. It bears stressing that this choice was driven primarily by pragmatic concerns, as opposed to any beliefs about the superiority of functional programming languages in general or Erlang in particular. The fact that Erlang has a well-known track record in the implementation of production- quality reliable server systems with minimal downtime made it an easier selection to justify internally. The choice of language to implement computations in Disco was Python, which also emerged from pragmatic priorities. Python is a popular language in the scientific computing community, with numerous readily- available libraries. It is simple to write Python programs using the extensive Python libraries to analyze relatively unstructured log and data files, an important initial use case for Disco. Once the core of Disco was implemented in Erlang, a need was felt for a custom distributed filesystem, since experiences with off-the-shelf solutions for Linux proved unsatisfactory. Many of the interfaces and abstractions offered by POSIX-compatible filesystems were thought unnecessary in the MapReduce set- ting; instead, a minimal set of requirements and interfaces was chosen, and these were implemented as the Disco Distributed Filesystem (DDFS). Again, the choice of Erlang for the implementation proved sound: the tag-based replicating DDFS has been implemented in 2.8K lines of Erlang (including comments). The Erlang portion of Disco itself is 6.3K lines.

We’ve recently implemented a support library [2] for writing Disco jobs in OCaml. This takes advantage of the support for workers in arbitrary languages added to Disco 0.4. The talk will not discuss the architecture of Disco [3]. Instead, we’ll discuss our experiences using Erlang, Python and OCaml. Although Erlang is a dynamically-typed functional programming language, there are type-checking tools available like Typer and Dialyzer whose use make it possible to apply some amount of static analysis to an Erlang codebase. In the case of Disco, the use of these tools was introduced after the system had been developed; we will talk about some issues we faced in introducing these tools post-facto, and what new Erlang projects can do to avoid these issues. We will present the results of using these tools, some of the issues and bugs they found, and some of the features we wished they had. We will also discuss our experiences in using property-testing Quickcheck-like tools for Erlang like Triq and Proper. One problem that projects in unusual programming languages face is getting even interested parties to contribute to the codebase. For instance, even though one of the new contributors to Disco had a functional programming background, it was not straightforward for him to quickly understand the Erlang code of Disco, even though Erlang is a fairly simple functional language. We will explore the reasons for this, and what we think may help in reducing such hurdles[4]. Lastly, we will discuss some of the Nokia-internal consequences of using Erlang for Disco.

[1] The Disco Project is at http://discoproject.org. [2] ODisco, available at https://github.com/pmundkur/odisco. [3] A paper on the architecture has been submitted to the Erlang Workshop 2011. [4] In particular, we think this is because Erlang programs have a highly dynamic process and linkage structure that is not easily understood by an Erlang newcomer by just looking at source code. Experience needs to be gained with running systems via logs and interaction, more so than with projects in most other languages, which often use relatively static threading structures. Process tree diagrams indicating parent-child and linkage relationships in the documentation are invaluable to get oriented.