So you managed to survive the first post and are still hungering for more? Don’t worry, I got you covered. This time around, we’ll get into more of the peripheral, optional components that might be useful to you. The format will largely be the same as the first post, so let’s get right to it.
Oozie is the staple workflow management software for Hadoop. It translates a series of tasks into a directed acyclic graph and runs them on Hadoop. It serves the same purpose as Tez, but predates Tez and runs on top of other Hadoop constructs rather than below them. Tez is basically the optimization of the Oozie idea, so I imagine Oozie will be less relevant in the future.
Falcon provides a declarative DSL for defining workflows to run atop Hadoop clusters. It also includes the ability to define how to get data into the system and what to do with the data coming out of that system. It uses Oozie under the hood to handle the scheduling details. Additionally, Falcon handles features such as replication and failover so you can manage jobs across multiple Hadoop clusters.
There are a lot of works in progress to provide security mechanisms in Hadoop. Knox provides perimeter security by acting as a gateway into the cluster by offering a secured, access controlled, reverse proxy to the various pieces of software running in the cluster.
Sentry provides table-level access controls when used in combination with Hive and Hbase.
Many projects support some level authentication and access control via Kerberos, and you can enable encryption on all network traffic in HDFS. This negatively impacts performance, but it prevents snooping on sensitive data to some extent.
There is a lot of work going on in this arena to give better, more granular access controls and data security to underlying data structures, so stay tuned.
Ambari is a UI/API/CLI used for managing clusters, and as such is more focused on operators. There is some work to provide more user-centric views within Ambari, but it’s not the primary focus.
Hue is a UI for writing MapReduce jobs, hive queries, pig scripts, etc., and for visualizing monitoring data from a cluster and is therefore focused more on users.
Cloudera also has a proprietary cluster manager in lieu of Ambari, but because I am sticking with the open source projects for now, so I know little else about it.
Graph databases are useful when the important parts of the data are the relationships between the data. Companies like Facebook and LinkedIn use these heavily to figure out who you might know based on common relationships. Giraph is a graph processing engine that works with graph data in the Hadoop ecosystem.
GraphX is a graphing database that runs atop Spark. It promises not only to do graph processing, but also graph construction as well.
There are several vendors who bundle a set of Hadoop software pieces and provide a distribution that is well-tested and has support available. Hortonworks and Cloudera are the two big players in this area, and MapR is probably number three. I believe there are other smaller players, but I don’t recall who they are. Let me know in the comments which ones I missed.
Apache Bigtop is a vendor agnostic, open source project to package all Hadoop-related software in order to make operators’ jobs easier. The distributions rely on Bigtop for upstream packages, picking and choosing the pieces they wish to support, and validating that they work together.
Hadoop software uses a few protocols to communicate with other pieces of Hadoop software. Some of this is over REST APIs, but other common protocols are Thrift and Avro, which are both binary protocols that provide performance advantages over the text-based REST protocol. The nitty-gritty details are out there should you need to know them, but just knowing that they are binary protocols is probably enough to follow along.
There is growing support for Google’s open-sourced binary protocol called Protocol Buffers, so that’s something to be aware of.
Machine Learning is a common use-case for big data systems, and that’s no exception with Hadoop. Mahout is the most commonly mentioned machine learning system that works with Hadoop. Although originally designed to work with MapReduce, and still supporting old algorithms that work with it, development has shifted focus to Spark-based algorithms. All new development work will be using Spark.
Mlib is similar to Mahout but works with Spark instead of MapReduce. Since Mahout has shifted focus to Spark as well, there is now not a lot of difference from a high-level perspective.
So, you have a lot of data in existing data stores, and you want to use all this wonderful software to analyze it? You can either use one of the filesystem abstractions, if your data is in a format that Hadoop can speak to; write your own; or use software designed to translate your data into a format that HDFS can consume. Sqoop is software for taking data from relational databases and populating it into HDFS for you, and Flume is designed to aggregate log data and populate it into HDFS.
Kafka is used for streaming log data into Hadoop. It seems to be used most commonly to stream data into systems like Spark or Storm, which then process it and pass it along to HDFS and/or other storage layers.
Zookeeper is a distributed key/value store that provides locking semantics and can be used both as configuration management and a message queue as well. A lot of Hadoop-related software takes advantage of it for cluster coordination.
SOLR is a search engine that was built on top of Lucene. It is seeing increased usage in the Hadoop ecosystem, as it can natively use HDFS for storing its indexes. This lets you use other Hadoop tools to generate and manipulate the index data, but then have a powerful search interface into that data as well.
And there you have it; Hadoop in a nutshell. While I feel like it’s a fairly thorough primer, there is definitely more software out there in this ecosystem. Hopefully I’ve gotten you a little farther down the rabbit hole and you can take it from here.