Apache Spark - Introduction

Having established the problems with traditional Map Reduce and how Spark was born (You can check it out here if you have missed it), let us dive into understanding the Spark project a little deeper.

Apache Spark is a fast and general-purpose engine for large scale data processing, under the hood it works on a cluster of computers.

The ecosystem of Apache Spark can be summarized in the following image,

As you can see, the spark ecosystem consists of majorly 2 constituents;

Spark Core
Set of libraries and APIs.

One can notice that the spark does not have any inbuilt Cluster Management System and Distributed Storage system. Spark depends on third-party Cluster management system. It supports a variety of cluster managers such as 'YARN', 'MESOS' and 'Kubernetes'. Similarly, it depends on third-party distributed storage systems, such as 'HDFS', 'Amazon S3', 'Google Cloud Storage(GCS)' or 'Casandra File System'.

Spark Core

The Spark Core again has 2 major parts; 1. Spark core APIs, and 2. A Compute Engine.

Spark speciality lies in its Compute Engine, which takes care of memory management, task scheduling, interacting with cluster manager and fault recovery.

Spark Core APIs consists of two sets of API, which are available in a variety of languages such as Scala, Java, R and Python.

Structured APIs consists of Data Frames and Data Sets. They deal with structured data.
Unstructured APIs are lower-level APIs, which include RDD, Accumulators and Broadcast variables.

Set of libraries and APIs

There are multiple packages of libraries which spark supplies, such as

Spark SQL, allows you to use SQL queries for structured data.
Spark Streaming, provides mechanism for dealing with real-time, streams of data.
ML Lib, allows and provides libraries for machine learning processing.
GraphX, provides libraries for graph data processing.

All these packages provide APIs, algorithms and DSLs which uses the Spark's compute engine to achieve programming in a distributed fashion.

Why has Spark gained so much attention?

Apache Spark attracted the big data community because of the following reason:

Abstraction to the parallel programming.

Programmers will have to deal with high-level SQL or collections, not having to worry about the internal parallel programming constructs.

Unified platform: Spark provides support for batch processing, structured data processing(SQL kind of data), streaming and real-time data processing, ML and graph data processing, all under one roof.
Ease of use: Compared to the Map Reduce, spark programming is easier to learn and understand. Spark community has a lot of tools to work with and keeps adding new ones every release.

Techie's Tech Journal

Search This Blog