By Fernando Doglio

What is the “big data problem”?

 

“On the night of July 9, 1958 an earthquake along the Fairweather Fault in the Alaska Panhandle loosened about 40 million cubic yards (30.6 million cubic meters) of rock high above the northeastern shore of Lituya Bay. This mass of rock plunged from an altitude of approximately 3000 feet (914 meters) down into the waters of Gilbert Inlet (see map below). The impact generated a local tsunami that crashed against the southwest shoreline of Gilbert Inlet. The wave hit with such power that it swept completely over the spur of land that separates Gilbert Inlet from the main body of Lituya Bay. The wave then continued down the entire length of Lituya Bay, over La Chaussee Spit and into the Gulf of Alaska. The force of the wave removed all trees and vegetation from elevations as high as 1720 feet (524 meters) above sea level. Millions of trees were uprooted and swept away by the wave. This is the highest wave that has ever been known.“ (quoted from http://geology.com/records/biggest-tsunami.shtml)

Now lets use our imagination a bit, and pretend we’re on a digital world, and that an even bigger wave can be seen on the horizon, only that the wave is made up of 1’s and 0’s. That’s the current status of information on the net right now.

A huge wave of data is being generated every second, ranging from user generated information such as tweets, status updates, uploaded pictures, blog posts, comments, text messages, e-mails and so on to machine generated data, like server access logs, error logs, transaction logs, etc.

And that’s not even the problem, the problem is that we need to start thinking in terms of TB or even PB of information, billions of rows instead of millions of them in order to be able to handle this big wave that’s coming.

Normally, when you have to store information on your application you ask yourself one basic question:

What do I need this information for?

And from the answer you get, you plan your storage and you start saving that specific information.

Lets look at an example, from two different perspectives:

Traditional way of thinking:

Say for example, you’re a web development company and you’re asked to create a basic web analytics  app for your company site.  So you ask yourself:

What do I need the  information for?

As an answer, you might get something like:

  • To get number of visits to each page.
  • To get a list of referrer sites.
  • To get the number of unique visits.
  • To get a list of web browsers used on the site.


It’s a short list, I know, but this is a basic example.

Back to the problem:

You have your answer, all that information can be fetched from the server’s access log, so you configure your log files to store that information, great! You’re done!

Yes, you’re done, you got your system ready, it shows the information you were asked to show, but you also closed the door to other potential analytics that could come out of the information stored on those access logs (like request method used, response code given, size of the object returned and so on) and other sources of information.

Thinking in “big data” terms:

Thinking in “big data” terms means (at least to me), saving all the information you’re working with on your project and then finding out new and exiting ways to interpret that information and get results out of it.

Back to the problem, with the “big data” way of thinking this time:

This time around, you think in “big data” terms, so you already have lots of data being saved for every visit, such as:

  • Access log information.
  • Error log information.
  • User input (if there is any)
  • User behavior data (such as clicking patterns and smiliar)
  • and so on.


That’s because when you created your website, you asked yourself a different question:

What is all the information I can get from my website?

And since you changed your question, you significantly change the answer to your problem.

You now have a vast amount of information to analyse and get insight from.

This is great, but where do we store all this log information? It could potentially become too much for a single machine and we don’t want to loose any information by rotating logs and using other techniques.

So another valid question would be:

What kind of hardware do I need to store and process all that information in a timely manner?

What kind of hardware do we need then?

We need some kind of setup that will allow us to:

  • Store vasts amounts of data
  • Process this data in a timely manner
  • Be able to grow as much as we want (storage and processing power wise)
  • Be fault tolerant (storage and processing power wise)
  • Affordable


That is a lot to ask (specially if we consider the last point) of a single computer, isn’t it? So the answer will probably come in the form of a distributed system.

Enter Hadoop

What is Hadoop? In a nutshell, Hadoop is the solution to our problems, one of them (mind you), but a pretty powerful one at that.

In more detail, Hadoop is an Open Source Apache project, dedicated to solve two major problems related to big data:

  1. Where to store all of the information?
  2. How to process that information at an affordable cost and in an reasonable amount of time?


To answer these questions, Hadoop provides the following solutions:

HDFS

This is the Hadoop Distributed File System, it allows us to store in a reliable way all the information we need.
This works by interconnecting commodity machines (affordable) and using the resulting shared storage (Store vasts amounts of data).
The HDFS takes whatever we throw at it and splits the files into evenly sized chunks of data, and then spreads them throughout the cluster. In this stage, it also replicates the files, providing data redundancy and fault tolerance.

Thanks to the HDFS we can have  as much storage capacity as we need, by adding new machines to the cluster (Be able to grow as much as we want ).
We also gain a very important asset, that is fault tolerance. Since we’re replicating  the information into several nodes of the cluster, our commodity machines are free to fail and the only place where that will affect us is performance (no data loss or incomplete information).

MapReduce

This is the other “leg” of Hadoop, an implementation of the MapReduce algorithm proposed by Google in 2004.

The MapReduce algorithm allows us to process large amounts of information (terabytes of information) in a distributed (thouthend of nodes) and fault tolerant manner (Process this data in a timely manner) . And if we consider that we already have a cluster of computers working for us with the HDFS, MapReduce is the perfect match to take advantage of that computational power sitting there on every node of the cluster.

This algorithm has two basic steps:

  1. Map step: In this stage, the input data will be split into smaller chunks to be analyzed and transformed by processes called “mappers”. Thanks to the integration with the HDFS, the main node will effectively schedule map jobs to use the data that’s already on the nodes they’re running on, allowing the system to utilize very little bandwidth. The output of these mapper jobs will be a set of (key, value) tuples.
  2. Reduce step: The output of the mappers will be sent into the reduce jobs. These jobs will process the information with an added benefit of knowing that it’s input will be given in a sorted manner by the system. They’re main purpose is to aggregate the information given by the mappers and output only that which is needed.


There is an implicit step between 1 and 2, that is the shuffle & sort step, done by the system automatically. In this step, the system will sort the output of all mapper nodes by the key of each tuple and it’ll send these sorted results into the reduce nodes, assuring that all tuples with the same key will go to the same reducer.

Thinking the Hadoop way

So, we have our data, we have our questions to ask to that data, we have our needs and we have our solution. What now?

Your following steps could include:

  1. Installing and configuring your Hadoop cluster:  For this step, the company Cloudera has a standarized distribution of hadoop, which they call Cloudera’s Distribution Including Apache Hadoop(CDH). You can download it for free and it comes with serveral other projects from the Hadoop ecosystem (such as Pig, Hive, Hbase, and so on). And for managing your cluster, you could use their Cloudera Manager, which allows you to manage up to 50 nodes for free.
  2. Upload the information to your HDFS.
  3. Transform your information using a MapReduce job: I consider this step to be optional. I would use a “hand-written” MapReduce job if I had transform my data set in a specific way in order to query it later on.
  4. Query your data set: There are several ways to do this, tools like Pig or Hive, allow you write MapReduce jobs (for data transformation) on a higher level language (PigLatin or SQL). Others like HBase and Cassandra work better for quick queries to that data, they work directly over the HDFS ignoring the MapReduce framework, but you’re a bit limited on what you can do with the information.

 

And finally, a pretty common question:

Is Hadoop the best solution for big data analysis out there?

Probably not, since “the best” is always relative to your needs, but it’s a pretty darn good one, so give it a try.

Besides, all the cool kids are doing it:

  • Facebook – 15 PB of information last time they revealed the number.
  • Ebay – 5.3 PB of information on their clusters.
  • LinkedIn
  • Twitter


And many others, check out the complete list here.