So you wanna learn you some Hadoop, eh? Well, get ready to drink from the firehose, because the Hadoop ecosystem is crammed full of software, much of which duplicates efforts, and much of which is named so similarly that it’s very confusing for newcomers.
The software I work on at my job provisions and manages Hadoop clusters on the cloud. There’s one problem, though: I know next to nothing about Hadoop! I’ve spent the last month or so trying to learn as much as I can as quickly as I can so I can provide useful input as to the direction of our product. What follows is what I’ve gleaned so far.
Due to the large amount of information involved here, I’ve split the post into two parts. The first one covers the most common and core pieces of functionality, while the second will cover more peripheral or optional features. First, I’ve provided a chart categorizing the numerous projects, then I’ll explain in a little more detail, but still high level, what each piece of the ecosystem is for. I hope this will help guide other newbies who are ramping up to have some idea of what’s going on until they can dig in deeper on specific subjects. This is not meant to be exhaustive, and certainly won’t stay 100 percent up-to-date since new software comes into the ecosystem fairly frequently, but it should give you a good grounding to be able to follow conversations.
NOTE: I apologize in advance for any mistakes or omissions. I’m new to this and so my understanding might be slightly off due to the sheer volume of information I’ve absorbed to prepare this. Please let me know in the comments and I’ll fix things up.
Data storage is what it sounds like. It’s where your data lives. The default storage backend for Hadoop is HDFS (Hadoop Distributed File System), which provides a singular logical filesystem across a cluster of machines. It can grow or shrink on demand, has built-in replication so data is safe from failure, and is fault-tolerant. However, Hadoop is flexible. It lets you use other filesystems like GlusterFS, and it has a number of plugins for a number of other data storage layers. MapR wrote its own distributed filesystem that uses an NFS interface and relies on local filesystems rather than using Namenodes, and they claim some impressive performance numbers with their system. You can also plug it in to object stores like Amazon S3 and OpenStack Swift, as well as distributed databases like MongoDB and Cassandra. HDFS will be your primary concern, but you should be aware that these other options exist.
The primary data processing framework for big data on Hadoop is MapReduce, which is based on the paper published by Google in 2004. The basic idea is that you process data in parallel by passing the processing code along to the location where the data lives (since the data is larger than the code in these cases). You map your function across all the data in parallel, and then reduce those results into a singular resultset. I’m oversimplifying, but that’s the basic idea. This is the predominant way to process data in Hadoop, and for a long time was the only way to process data.
Pig provides a scripting interface to running MapReduce jobs by exposing a DSL (domain-specific language) that abstracts away the parallelism of the system for you. You simply write a script that does what you need with the data, and Pig figures out the specifics of running it for you and returns the results. Newer versions of Pig use Tez rather than MapReduce to do the heavy lifting.
Spark is a high-performance parallel data processing engine. It is an alternative to Tez in many respects, as it also provides a way to process data using a DAG (directed acyclic graph), but it’s also an alternative to Pig as it can be run natively within Java, Scala, or Python programs to execute functions across the data in your cluster with ease.
Data streaming provides the ability to manipulate data as it enters or exits the system. Think of it as the big data version of an RSS reader, except you can programmatically manipulate the feed as it’s generated. The two big players in this arena are Spark Streaming and Storm. Spark Streaming obviously works with Spark, and Storm uses its own processing engine called Nimbus (although there is now support for Storm on YARN (Yet Another Resource Negotiator)).
Samza is another option here. It’s fairly similar to Storm, but uses YARN directly rather than having its own processing engine.
Columnar databases provide a column-oriented view of data in Hadoop. This is a popular form of NoSQL solution based on Google’s BigTable paper. If you aren’t familiar with this sort of database, please consult the Wikipedia entry to learn more, as I can’t possibly explain it better. Cassandra is a popular player in this space outside of the Hadoop ecosystem, and while it can be used within Hadoop as well, the most common player in this space for Hadoop is Hbase.
Hypertable is fairly similar, but with better performance (largely due to being written in C++ versus Java).
Accumulo focuses on cell-level security and is otherwise very similar to Hbase, meaning that you can prevent unauthorized access to individual pieces of data. The NSA developed it originally and they promise it doesn’t provide a backdoor to your data.
Cassandra is another columnar database that can be integrated into the Hadoop ecosystem, but it can be and usually is used as a standalone product.
Despite all the NoSQL hype these days, many people are still most comfortable with a SQL interface, and there are plenty of projects that provide such an interface to Hadoop. Hive is the oldest and most popular, but there are plenty of other players entering this space all the time.
Most of the other SQL engines differ from Hive in that they attempt to provide a more real-time query interface, where Hive is more built around batch processing of data via SQL.
Shark is basically Hive on Spark. It uses Spark rather than Tez/YARN to do its dirty work, but otherwise is very similar to Hive.
Impala is a high performance SQL engine that works directly with Hbase or HDFS.
Phoenix provides a SQL interface to Hbase, optimizing the query plan for you to utilize Hbase effectively.
Presto can amalgamate data from many data sources such as Hive, Hbase, and relational databases in a singular SQL interface.
BlinkDB is particularly promising, as it only reads a subset of the data (sampling) to provide an approximate result with a statistical representation of the error rate and confidence level. This lets you get results blindingly fast, and tune the queries to get the best performance/accuracy trade-off for your use case.
Hcatalog provides a tabular view of underlying data that can be used by SQL engines and is not really a SQL engine in and of itself. It has since been merged into Hive, but you still see references to it as a separate piece.
There are a lot of frameworks in the ecosystem, but the two you need to know most about are YARN and Tez. YARN is a framework for running distributed processing jobs. It underlies all modern Hadoop stacks, sitting between software and the data storage. Hive, Pig, Storm, etc., all utilize YARN to manage the resources needed to execute their jobs.
Tez converts jobs into directed acyclic graphs and determines the optimal way to run through those graphs on top of YARN. Current implementations of MapReduce, Pig, and Hive use Tez to do the planning and execution work for them. Prior to Tez, other software had to use MapReduce directly, but with Tez they get the underlying execution engine without requiring them to translate their jobs into full MapReduce jobs, thus providing a significant performance increase. It was developed primarily to benefit Hive, but is generic enough to be used by other projects as well.
That’s it for now. That covers most of the major pieces of the ecosystem, as well as all the core features. Up next we’ll explore some additional pieces that will likely come in handy.