The book Hadoop: The Definitive Guide by Tom White builds countless examples around NCDC’s sizable Integrated Surface Database. The dataset contains hourly temperature readings from thousands of weather stations around the globe, dating back until 1901. Each measurement is represented by a cryptic looking string:
Processing the dataset requires parsing the string to extract the information of interest. To follow the examples in the book, often only the temperature values are needed without all the additional information encoded in the string like the location of the weather station or weather station identifier. As a reader, one might be tempted to pre-process the dataset and store the mappings of year ↦ temperature in a smaller dataset. Subsequent analysis of this smaller dataset should be quite fast.
For this article, I’ve produced such a reduced dataset and stored the results in four different output formats:
- plain text,
- block-level compressed
SequenceFile is a binary serialization format
frequently used as input or output format of MapReduce tasks. The storage
requirements of these four cases were quite surprising. The MapReduce
code to generate the slimmed
dataset is provided as a maven package and a compiled
The following table summarizes the sizes of the reduced dataset
consisting only of
year: temperature pairs. The original dataset (1901 -
2000) is 267.5 GB.
Why is the compressed version so large, even larger than plain text and
SequenceFile? For the
compressed output, every temperature value is compressed independently. In our
case, this is just a single serialized
IntWritable value. The compression
introduces overhead for every single value. Any storage efficiency gained by
the binary format is offset by the compression overhead leading to an overall
size increase. Compression usually works better on larger data chunks.
In contrast, the block-level compressed
SequenceFile applies the compression algorithm to a
whole block including keys. The block size is controlled by the Hadoop
io.seqfile.compress.blocksize which defaults to 1 MB. With the default block
size, we can gain in terms of storage efficiency, making the block-level
SequenceFile the smallest of the four choices.
Depending on the MapReduce job, compressing the input can improve processing speed if the bottleneck is disk IO.