Skip to main content

Big Data, Hadoop, and birth of Spark

With the advent of social media, microblogging sites and apps such as facebook, Instagram, Twitter, etc and the Internet of Things (IoT) there has been a massive explosion of data. Data produced in a day are in terms of terabytes to petabytes. All these data have a lot of useful information, which needs to be harnessed out. This requires processing the data.

As you can imagine processing this much data can become a real pain on a single computing system even if the computer is very powerful with large disk space. The obvious solution was to split the data and process it on multiple computing devices. This idea was the basis for the genesis of Hadoop.  Here the data is split and spread across a cluster of machines that are connected in a network. The code that processes the data is copied to each machine that has the data. Then the code is run on the machine over the data that it holds. Finally, the processed information from each of these machines is consolidated in one go. As you can see here, we moved the code to process where the data resided instead of passing the data to the code that processes it.

Hadoop is built on this idea. Current day hadoop (Hadoop 2.0) provides 3 important utilities to get over the problem of processing big data.
  1. Distributed Storage system: HDFS
  2. Distributed computing framework: Map Reduce
  3. Resource Manager: YARN

Though this was a highly appreciated working model across the industry, Hadoop had its cons and some computer scientists criticized it. 
  • Map Reduce had a very large learning curve, and hard to master. 
  • Knowledge of Java programming was a prerequisite. 
  • Map Reduce had a lot of performance issues.
  • It was slow.
We needed a which solved these issues without compromising the results of the distributed data processing. 

A compelling alternative or a replacement for Map Reduce framework is "Apache Spark", which was born in UC Berkley. Which came to be known for its lightning-fast cluster computing. 
As defined in Apache website Spark is a fast and general-purpose engine for large scale data processing.

Comments

Popular posts from this blog

Apache Spark - Introduction

Having established the problems with traditional Map Reduce and how Spark was born (You can check it out here if you have missed it), let us dive into understanding the Spark project a little deeper. Apache Spark is a fast and general-purpose engine for large scale data processing , under the hood it works on a cluster of computers . The ecosystem of Apache Spark can be summarized in the following image, As you can see, the spark ecosystem consists of majorly 2 constituents;  Spark Core  Set of libraries and APIs. One can notice that the spark does not have any inbuilt Cluster Management System and Distributed Storage system. Spark depends on third-party Cluster management system. It supports a variety of cluster managers such as 'YARN', 'MESOS' and 'Kubernetes'. Similarly, it depends on third-party distributed storage systems, such as 'HDFS', 'Amazon S3', 'Google Cloud Storage(GCS)' or 'Casandra File Sy...