Big Data Processing Systems, Oversimplified

There are many ’names’ which are thrown around in the big data space. This post tries to describe them in few sentences. Most information are retrieved from each of their research papers. I’ll add new stuff as I read more papers.

Missing some stuffs? Found inaccuracies? Descriptions downright bad? Suggestions and corrections much appreciated: ongrios.wijaya@gmail.com.

Hadoop

Or, Hadoop Distributed File System (HDFS). Too much data to fit in a single computer, distribute it on multiple computers instead. Also, keep track of them and replicate them for durability.

MapReduce

Google engineers wrote lots of programs to process large datasets. Some components keep being rewritten: partitioning, fault tolerance, load balancing, etc. MapReduce framework abstract all of that stuff, and the user write data processing program using map and reduce pattern.

Pig

Pig Latin, its language, gives imperative twist to the declarative SQL, i.e. we should write what we want to do line by line. Compiles down to MapReduce tasks. Also provides debugging environment on each steps of the MapReduce subtasks, by sampling the dataset.

Hive

SQL interface which compiles down into a DAG of MapReduce jobs. With some SQL optimization stuffs, enabled by the Hive Metastore which stores tables statistics.

Impala

Provides efficient query execution layer with low-level optimization stuffs on C++. Each nodes can receive user queries and distribute the query subtasks to other nodes (P2P rather than typical MapReduce’s Master-Follower architecture).

Lucene / Solr

Lucene brings many algorithms to process text into more useful format through various advanced text cleaning algorithms. Turns them into indexes too, typical for robust and performant text search.

Solr is on top of Lucene. For what? Left as an exercise for readers

Spark

MapReduce-based systems were too slow because the framework requires the input and output of map/reduce tasks to be persisted on disk. Especially for ML tasks which usually keep iterating on the same datasets. Spark said: Let’s just keep every data in memory and shuffle them around if needed. They named that in-memory dataset abstraction as RDD (resilient distributed dataset).

Presto/Trino

Performant distributed SQL execution engine, without Hadoop environment. Versatile, accepts many data sources, without requiring the unification of format before execution (i.e. don’t need to LOAD the data into system’s format).

DuckDB

Full-fledged analytical system in a single node / embedded environment. Basically SQLite, but optimized for OLAP through their SQL optimizer, execution engine, and data formats. Still allows rows-based modification though.

Dryad

Distributed execution engine, with a graph programming model, where each vertices defines operations, and each edges defines the shuffles.

(This one seems to precede MapReduce, but I put it down here because I don’t think I can describe this one well.)


Extra notes for further discussions

  • Is X fast enough? On MapReduce era (2000ish), seems like hours / many minutes are what is considered ‘fast’.

The latter 2010ish era (Presto, Spark), considers seconds, or sub-seconds as fast.

Some systems specializes in certain domains (e.g. Lucene in textual data, some others as timeseries or graph stuffs, which I will add soon)

Ofc, sometimes anything works well enough for your usecase (Postgres as OLAP DB, writing python scripts, some kind of distributed processing engine, cloud vendor’s systems). So the key is to benchmark for your usecase, and consider the amount of user & operator headache. If it fits, it fits.

  • Language

SQL everywhere. Or, sometimes, programming-language library interface like dataframes.

  • Execution environment

To JVM or not to JVM? LLVM JIT? SIMD? GPU? Many interesting stuffs to write about.

  • Operation optimizations

Many interesting stuff, namely predicate pushdown, join reordering, column pruning, subquery optimizations. Will write more about this in the future.