T10: Scalding - The Scala Tool for Data Analytics in Hadoop Systems

  • Dean Wampler Concurrent Thought
September 24, 2013 2:00 - 5:30 PM


Scalding is a Scala Domain Specific Languages (DSL) for Cascading, a widely-used Java API for data analysis in Hadoop clusters. Cascading provides higher-level data flow abstractions that hide many of the low-level complexities of Hadoop’s Java API, thereby accelerating application development. Scalding exploits the functional-programming features and elegant DSL support in Scala to allow developers to write concise Cascading programs with a syntax that fits data analysis and transformation in a natural way. In fact, Scalding has the same purpose-built feel for general data analysis and transforms that SQL has for queries. Scalding is comparable to Cascalog, a Clojure-based DSL for Cascading. This hands-on workshop introduces Scalding using examples of typical data analysis problems. Scala syntax is explained as needed. We’ll briefly compare Scalding to other high-level language options in common use, such as Hive and Pig. We’ll also see how functional-programming idioms are a natural fit for working with data. In fact, you can argue that SQL is a limited form of Functional Programming. In my view, analytics is an underappreciated “killer app” for the mainstream adoption of functional programming.


Go to the GitHub page for this tutorial and follow the instructions shown in the README, which tells you how to install the required tools: Git, Java, Scala, and the Scala build tool, sbt. (Even if you already have them installed, make sure you have the recent versions described in the README.) Please complete these steps in advance so we don't have to spend class time doing them.

Dean Wampler

Dean Wampler Dean Wampler specializes in “Big Data” application development, using Hadoop and alternative technologies. Dean is a contributer to several open-source projects and the founder of the Chicago-Area Scala Enthusiasts. He is the author of “Functional Programming for Java Developers”, the co-author of “Programming Scala”, and the co-author of “Programming Hive”, all from O’Reilly. He pontificates on twitter, @deanwampler, and at polyglotprogramming.com.