With the advent of social media, microblogging sites and apps such as facebook, Instagram, Twitter, etc and the Internet of Things (IoT) there has been a massive explosion of data. Data produced in a day are in terms of terabytes to petabytes. All these data have a lot of useful information, which needs to be harnessed out. This requires processing the data.
As you can imagine processing this much data can become a real pain on a single computing system even if the computer is very powerful with large disk space. The obvious solution was to split the data and process it on multiple computing devices. This idea was the basis for the genesis of Hadoop. Here the data is split and spread across a cluster of machines that are connected in a network. The code that processes the data is copied to each machine that has the data. Then the code is run on the machine over the data that it holds. Finally, the processed information from each of these machines is consolidated in one go. As you can see here, we moved the code to process where the data resided instead of passing the data to the code that processes it.
Hadoop is built on this idea. Current day hadoop (Hadoop 2.0) provides 3 important utilities to get over the problem of processing big data.

- Distributed Storage system: HDFS
- Distributed computing framework: Map Reduce
- Resource Manager: YARN
Though this was a highly appreciated working model across the industry, Hadoop had its cons and some computer scientists criticized it.
- Map Reduce had a very large learning curve, and hard to master.
- Knowledge of Java programming was a prerequisite.
- Map Reduce had a lot of performance issues.
- It was slow.
We needed a which solved these issues without compromising the results of the distributed data processing.
A compelling alternative or a replacement for Map Reduce framework is "Apache Spark", which was born in UC Berkley. Which came to be known for its lightning-fast cluster computing.
As defined in Apache website Spark is a fast and general-purpose engine for large scale data processing.
Comments
Post a Comment