As we have introduced ourselves to Spark in the previous post(check here ), lets see the spark architecture to get deeper understanding of how Spark works. Simply put, Spark is a distributed computing platform and we can run our programs created to be run on a cluster of machines. Now let us see, How to execute programs (jobs) on Spark cluster? There are 2 ways to do this! Interactive client (or, Spark Shell such as Scala Shell, PyShell, Notebooks) Best suited for exploration and experimental purposes. Submit operation (or, Submit jobs via APIs) All full fledged programs and projects, which needs to run on production, will submit jobs using the Submit utility provided by Spark. Spark runs the jobs submitted, using the Master-Slaves architecture. Every spark application, will have one Master process and multiple Slave processes. In case of Spark, the master is called as the Driver, and the slaves are called the Executor processes. The Driver is responsi...
Having established the problems with traditional Map Reduce and how Spark was born (You can check it out here if you have missed it), let us dive into understanding the Spark project a little deeper. Apache Spark is a fast and general-purpose engine for large scale data processing , under the hood it works on a cluster of computers . The ecosystem of Apache Spark can be summarized in the following image, As you can see, the spark ecosystem consists of majorly 2 constituents; Spark Core Set of libraries and APIs. One can notice that the spark does not have any inbuilt Cluster Management System and Distributed Storage system. Spark depends on third-party Cluster management system. It supports a variety of cluster managers such as 'YARN', 'MESOS' and 'Kubernetes'. Similarly, it depends on third-party distributed storage systems, such as 'HDFS', 'Amazon S3', 'Google Cloud Storage(GCS)' or 'Casandra File Sy...