Hadoop and friends

When I first started working with Hadoop and Spark, it felt like walking through a jungle: Each project depends on other projects or at least references many different projects with which this project can interoperate or to which it is the “better” alternative. In this article, I’ll try to clear this up a bit. The article should serve as a glossary.

The description is kept on a superficial level on purpose. This also means that I will not explain the detailed bits of how a project can be used or to which systems the project has interfaces, but rather how it’s usually used. The list contains a lot of competing or redundant technologies. Projects or technologies annotated with a star (★) are my personal favorites that I would recommend to a colleague.

Mangement and infrastructure

Apache Hadoop ★: Open-source framework to handle large amounts of data in a distributed way.
YARN ★: Hadoop's native cluster manager. The cluster manager arbitrates the cluster's resources cluster among the applications submitted to the cluster. To some extent, it is also responsible for the scheduling of jobs.
Apache ZooKeeper: A service to help build reliable, distributed applications. Hadoop depends on Zookeeper.
Google File System (GFS): Scalable, distributed and, fault-tolerant file system for data-intensive applications.
Hadoop Distributed File System (HDFS) ★: Scalable, distributed and, fault-tolerant file system for data-intensive applications.

Processing frameworks and processing models

Map reduce pattern: A programming model often used to process big data sets. The term refers to two operations applied to the data set (basically a large list): first a mapping of initial list items to derived values, and a reduce stage that computes "statistics" from the whole list. This programming model makes it *easy* to process the data set in a distributed way.
MapReduce: The name of a Google's proprietary library using the map reduce pattern.
Apache MapReduce: The implementation of the map reduce pattern for the Hadoop system. Usually, MapReduce is an "application" in a YARN cluster.
Apache Spark ★: Analytics engine for large data sets powered by directed-acyclic-graph scheduler and in-memory caching.
Spark SQL ★: Spark module for structured, SQL-like data processing. The module also exposes the data via a dataframe object.
Spark MLlib: Spark module providing scalable machine learning algorithms.
Apache Pig: General-purpose data processing platform that raises the level of abstraction compared to plain MapReduce applications. Scripts defined in a custom language (Pig Latin).
Apache Crunch: General-purpose data processing platform that raises the level of abstraction compared to plain MapReduce applications.
Apache Tez: Combines complex directed-acyclic-graph of tasks (for example from Pig/Hive) into a single MapReduce job.
Apache Mahout: Project to build scalable machine learning applications. Uses Apache Spark as a back-end.
Nutch: Extensible and scalable web crawler. Implemented MapReduce and a distributed file system in 2003, which have been spun out into Hadoop.

SQL-like Databases

Google BigQuery: Scalable datastore that supports SQL powered by Google Dremel.
Presto ★: Low latency SQL database engine. Can read data from multiple sources and formats. Can run with or without Hadoop.
Apache Impala: Low latency SQL database engine working with data stored in HBase.
Apache Cassandra: Columnar SQL-like database engine
Apache Hive ★: SQL database engine for batch processing built on MapReduce.

No-SQL Databases

Apache HBase ★: Non-relational, columnar database implementation modeled after Google's BigTable. Uses HDFS but does not rely on YARN and MapReduce.
Apache Drill: Low latency No-SQL database engine. Can read data from multiple sources and formats.
Apache Druid ★: Columnar, distributed data-store for real-time analysis.
Google BigTable: Non-relational database implementation, similar to HBase.

Data import and data streaming

Apache Sqoop ★: Tool to transfer bulk data between relational databases and Hadoop.
Apache Flume: Library for effective collection and aggregating of log data.
Apache Kafka ★: Distributed event-streaming platform

Data formats and serialization frameworks

Avro: Serialization type and protocol definition in JSON files, primarily used in Hadoop. Does not require code generation when the schema changes.
Parquet ★: Wide-spread, fast columnar data storage format.
Protocol Buffers ★: Serialization framework using code generation
SequenceFile: Hadoop file consisting of a list of binary key-value pairs.
Resilient Distributed Dataset (RDD) ★: Spark's primary data structure, a partitioned list of arbitrary elements.