Big Data and OpenStack are probably the two most mentioned terms in the data center today. It’s become increasingly apparent that organizations have to deal with such massive amounts of data that traditional methods of scaling up with a bigger server or bigger memory or bigger disk are no longer feasible.
Nature of the problem
The Big Data problem is typically categorized by the “three Vs,” each of which is self evident and somewhat interrelated.
A non-disruptive solution to this problem is to “scale up”. However, the problem of “scaling up” is illustrated in the following graph. The time and cost to process the data grows exponentially.
To be able to process the data within reasonable resources and time, a scaling out approach will offer superior results, as illustrated below. The goal is to flatten the cost and the application response time.
Hadoop takes this approach to Big Data by relying on commodity servers and providing an infrastructure that does data replication and has a degree of tolerance to faults, which is inherent in distributed systems.
Other Requirements and Ecosystem
To be able to process data in a timely fashion is a foremost requirement. Although batch processing of data will meet the needs in many organizations, there are other critical requirements as well. For instance:
- Minimize data movement
- Ability to encrypt data (for compliance reasons)
- Ability to work with a variety of data loaders (ETL)
- Data archival
- Ability to do real-time queries
- Ability for business (non-technical) folks to write queries
- Ability to integrate with existing database/data warehouses
This has led to an explosion of ecosystem of vendors and products that embrace the horizontal scaling approach to try and meet the requirements outlined above. The downsides of these approaches are that they eschew time tested development techniques such as data normalization, joins and transactions (ACID properties).
The Taxonomies of these approaches include (but not limited to):
- Hadoop: Based on a file system called Hadoop Distributed File System (HDFS) and related technologies such as Map/Reduce
- NoSQL: MongoDB, Cassandra, CouchDB, Couchbase and so on
- NewSQL: InnoDB, Scalebase and newer technologies like NuoDB and so on
Hadoop is by far the most popular, but not the only elephant in the room. They all take the scale out approach and are governed by the CAP theorem (or Brewer’s conjecture). Unlike relational databases, which was a universal data platform until recently, none of these technologies or products meet all of the requirements of a business as outlined above.
Irrespective of the data platform, there is a need to be able to stand these up in a cloud, which is where OpenStack and the private clouds derived from OpenStack (such as the Rackspace Private Cloud) can help with standardizing operations in the data center and speed up development. For example, data security and privacy might be a critical requirement that can be met only by a private cloud.
There are prevailing questions about virtualization and keeping the data localized to the compute, but most approaches involve multiple technologies installed on an elastic infrastructure running on OpenStack or a hybrid cloud.
An example of multiple products and nodes running on a Rackspace Hybrid Cloud is illustrated below and in this talk on this topic presented at the OpenStack Design Summit.
There are multiple products and technologies running in a hybrid cloud that are able to meet the requirements of mining information off the Rackspace public cloud and being able to use that intelligence to monitor and tweak the public cloud.
The Big Data problem is a multi-faceted one. The traditional relational database approach has given way to a scale out approach. To be able to manage this elastic infrastructure to meet the varying needs of data, a multi-technology and a hybrid approach based on OpenStack might be best suited with an eye to the future.