Techie's Tech Journal

Posts

Apache Spark - Introduction

Having established the problems with traditional Map Reduce and how Spark was born (You can check it out here if you have missed it), let us dive into understanding the Spark project a little deeper. Apache Spark is a fast and general-purpose engine for large scale data processing , under the hood it works on a cluster of computers . The ecosystem of Apache Spark can be summarized in the following image, As you can see, the spark ecosystem consists of majorly 2 constituents; Spark Core Set of libraries and APIs. One can notice that the spark does not have any inbuilt Cluster Management System and Distributed Storage system. Spark depends on third-party Cluster management system. It supports a variety of cluster managers such as 'YARN', 'MESOS' and 'Kubernetes'. Similarly, it depends on third-party distributed storage systems, such as 'HDFS', 'Amazon S3', 'Google Cloud Storage(GCS)' or 'Casandra File Sy...

Big Data, Hadoop, and birth of Spark

With the advent of social media, microblogging sites and apps such as facebook, Instagram, Twitter, etc and the Internet of Things (IoT) there has been a massive explosion of data. Data produced in a day are in terms of terabytes to petabytes. All these data have a lot of useful information, which needs to be harnessed out. This requires processing the data. As you can imagine processing this much data can become a real pain on a single computing system even if the computer is very powerful with large disk space. The obvious solution was to split the data and process it on multiple computing devices. This idea was the basis for the genesis of Hadoop. Here the data is split and spread across a cluster of machines that are connected in a network. The code that processes the data is copied to each machine that has the data. Then the code is run on the machine over the data that it holds. Finally, the processed information from each of these machines is consolidated in one go. As ...

Techie's Tech Journal

Search This Blog

Posts

Spark Architecture

Apache Spark - Introduction

Big Data, Hadoop, and birth of Spark