When I first started working with Hadoop and Spark, it felt like walking through a jungle: Each project depends on other projects or at least references many different projects with which this project can interoperate or to which it is the “better” alternative. In this article, I’ll try to clear this up a bit. The article should serve as a glossary.

The description is kept on a superficial level on purpose. This also means that I will not explain the detailed bits of how a project can be used or to which systems the project has interfaces, but rather how it’s usually used. The list contains a lot of competing or redundant technologies. Projects or technologies annotated with a star (★) are my personal favorites that I would recommend to a colleague.

Mangement and infrastructure

Apache Hadoop
Open-source framework to handle large amounts of data in a distributed way.
YARN
Hadoop's native cluster manager. The cluster manager arbitrates the cluster's resources cluster among the applications submitted to the cluster. To some extent, it is also responsible for the scheduling of jobs.
Apache ZooKeeper
A service to help build reliable, distributed applications. Hadoop depends on Zookeeper.
Google File System (GFS)
Scalable, distributed and, fault-tolerant file system for data-intensive applications.
Hadoop Distributed File System (HDFS)
Scalable, distributed and, fault-tolerant file system for data-intensive applications.

Processing frameworks and processing models

Map reduce pattern
A programming model often used to process big data sets. The term refers to two operations applied to the data set (basically a large list): first a mapping of initial list items to derived values, and a reduce stage that computes "statistics" from the whole list. This programming model makes it *easy* to process the data set in a distributed way.
MapReduce
The name of a Google's proprietary library using the map reduce pattern.
Apache MapReduce
The implementation of the map reduce pattern for the Hadoop system. Usually, MapReduce is an "application" in a YARN cluster.
Apache Spark
Analytics engine for large data sets powered by directed-acyclic-graph scheduler and in-memory caching.
Spark SQL
Spark module for structured, SQL-like data processing. The module also exposes the data via a dataframe object.
Spark MLlib
Spark module providing scalable machine learning algorithms.
Apache Pig
General-purpose data processing platform that raises the level of abstraction compared to plain MapReduce applications. Scripts defined in a custom language (Pig Latin).
Apache Crunch
General-purpose data processing platform that raises the level of abstraction compared to plain MapReduce applications.
Apache Tez
Combines complex directed-acyclic-graph of tasks (for example from Pig/Hive) into a single MapReduce job.
Apache Mahout
Project to build scalable machine learning applications. Uses Apache Spark as a back-end.
Nutch
Extensible and scalable web crawler. Implemented MapReduce and a distributed file system in 2003, which have been spun out into Hadoop.

SQL-like Databases

Google BigQuery
Scalable datastore that supports SQL powered by Google Dremel.
Presto
Low latency SQL database engine. Can read data from multiple sources and formats. Can run with or without Hadoop.
Apache Impala
Low latency SQL database engine working with data stored in HBase.
Apache Cassandra
Columnar SQL-like database engine
Apache Hive
SQL database engine for batch processing built on MapReduce.

No-SQL Databases

Apache HBase
Non-relational, columnar database implementation modeled after Google's BigTable. Uses HDFS but does not rely on YARN and MapReduce.
Apache Drill
Low latency No-SQL database engine. Can read data from multiple sources and formats.
Apache Druid
Columnar, distributed data-store for real-time analysis.
Google BigTable
Non-relational database implementation, similar to HBase.

Data import and data streaming

Apache Sqoop
Tool to transfer bulk data between relational databases and Hadoop.
Apache Flume
Library for effective collection and aggregating of log data.
Apache Kafka
Distributed event-streaming platform

Data formats and serialization frameworks

Avro
Serialization type and protocol definition in JSON files, primarily used in Hadoop. Does not require code generation when the schema changes.
Parquet
Wide-spread, fast columnar data storage format.
Protocol Buffers
Serialization framework using code generation
SequenceFile
Hadoop file consisting of a list of binary key-value pairs.
Resilient Distributed Dataset (RDD)
Spark's primary data structure, a partitioned list of arbitrary elements.