SequenceFile compression

The book Hadoop: The Definitive Guide by Tom White builds countless examples around NCDC’s sizable Integrated Surface Database. The dataset contains hourly temperature readings from thousands of weather stations around the globe, dating back until 1901. Each measurement is represented by a cryptic looking string:

0057032620999991927011506004+55017-001417FM-12+003399999V0202301N00101004501CN0100001N9+00281+99999098261ADDAY161999GF108991081061004501999999MD1310061+9999MW1161

Processing the dataset requires parsing the string to extract the information of interest. To follow the examples in the book, often only the temperature values are needed without all the additional information encoded in the string like the location of the weather station or weather station identifier. As a reader, one might be tempted to pre-process the dataset and store the mappings of year ↦ temperature in a smaller dataset. Subsequent analysis of this smaller dataset should be quite fast.

For this article, I’ve produced such a reduced dataset and stored the results in four different output formats:

plain text,
uncompressed SequenceFile,
compressed SequenceFile and
block-level compressed SequenceFile.

SequenceFile is a binary serialization format frequently used as input or output format of MapReduce tasks. The storage requirements of these four cases were quite surprising. The MapReduce code to generate the slimmed dataset is provided as a maven package and a compiled JAR.

The following table summarizes the sizes of the reduced dataset consisting only of year: temperature pairs. The original dataset (1901 - 2000) is 267.5 GB.

Format	Size
Text	9.4 GB
`SequenceFile`	17.1 GB
compressed `SequenceFile`	18.0 GB
block-level compressed `SequenceFile`	1.0 GB

Why is the compressed version so large, even larger than plain text and the uncompressed SequenceFile? For the compressed output, every temperature value is compressed independently. In our case, this is just a single serialized IntWritable value. The compression introduces overhead for every single value. Any storage efficiency gained by the binary format is offset by the compression overhead leading to an overall size increase. Compression usually works better on larger data chunks.

In contrast, the block-level compressed SequenceFile applies the compression algorithm to a whole block including keys. The block size is controlled by the Hadoop configuration parameter io.seqfile.compress.blocksize which defaults to 1 MB. With the default block size, we can gain in terms of storage efficiency, making the block-level compressed SequenceFile the smallest of the four choices.

Depending on the MapReduce job, compressing the input can improve processing speed if the bottleneck is disk IO.