Big Data: BIG and Bigger

Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why?

More accurate analyses may lead to more confident decision making. And better decisions can mean better financial management, operational efficiencies, better healthcare diagnosis, more efficient energy consumption, and reduced risk.

Big data defined

As far back as 2001, industry analyst Doug Laney articulated the now mainstream definition of big data as the three Vs of big data: volume, velocity and variety.

Volume. Many factors contribute to the increase in data volume.

  • Transaction-based data stored through the years.
  • Unstructured data streaming in from social media.
  • Increasing amounts of sensor and machine-to-machine data being collected.

In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.

Velocity. Data is streaming in at unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.

Variety. Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is hard.

There are two additional dimensions to consider as well:

Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data involved.

Complexity. Today’s data comes from multiple sources. And it is still an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.

In 2002, Peter Lyman and Hal R. Varian at UC Berkeley publish “How Much Information?” It is the first comprehensive study to quantify in computer storage terms, the total amount of new and original information created in the world annually. The study finds that in 1999, the world produced about 1.5 exabyte of unique information, or about 250 megabytes for every man, woman, and child on earth.

Five years ago, few people had heard the phrase “Big Data.” Now, it’s hard to go a day without seeing it in newspapers or publications. Meanwhile, nobody seems quite sure exactly what the phrase means, beyond a general impression of the storage and analysis of unfathomable amounts of information.

In February 2011, Martin Hilbert and Priscila Lopez publish “The World’s Technological Capacity to Store, Communicate, and Compute Information” inScience. They estimate that the world’s information storage capacity grew at a compound annual growth rate of 25% per year between 1986 and 2007. They also estimate that in 1986, 99.2% of all storage capacity was analog, but in 2007, 94% of storage capacity was digital, a complete reversal of roles.

In December 2012, IDC and EMC estimated the size of the digital universe to be 2,837 exabytes (EB). In 2020, it is forecast to grow to 40,000EB — effectively doubling every two years. One exabyte (EB) equals a thousand petabytes (PB), or a million terabytes (TB), or a billion gigabytes (GB). So according to IDC and EMC, the digital universe will amount to over 5,200GB per person on the planet by 2020.

SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP, and SAP partner’s such as Group Basis (www.groupbasis.com). HANA’s architecture is designed to handle both high transaction rates and complex query processing on the same platform. SAP HANA has completely transformed the database industry by combining database, data processing, and application platform capabilities in a single in-memory platform. The platform also provides libraries for predictive, planning, text processing, spatial, and business analytics — all on the same architecture.

One thing is for sure, it’s going to be big.