As we have introduced ourselves to Spark in the previous post(check here), lets see the spark architecture to get deeper understanding of how Spark works.
Simply put, Spark is a distributed computing platform and we can run our programs created to be run on a cluster of machines.
Now let us see, How to execute programs (jobs) on Spark cluster?
There are 2 ways to do this!
Simply put, Spark is a distributed computing platform and we can run our programs created to be run on a cluster of machines.
Now let us see, How to execute programs (jobs) on Spark cluster?
There are 2 ways to do this!
- Interactive client (or, Spark Shell such as Scala Shell, PyShell, Notebooks)
- Best suited for exploration and experimental purposes.
- Submit operation (or, Submit jobs via APIs)
- All full fledged programs and projects, which needs to run on production, will submit jobs using the Submit utility provided by Spark.
Spark runs the jobs submitted, using the Master-Slaves architecture. Every spark application, will have one Master process and multiple Slave processes.
In case of Spark, the master is called as the Driver, and the slaves are called the Executor processes.
The Driver is responsible for Analyzing, monitoring, distributing and scheduling the work/job submitted across the machines in the cluster. It also maintains all the necessary information required in the lifetime of the operation.
The Executors on the other hand, perform the tasks received by them from the driver and report back to the driver.
There are 2 ways this can happen:
There are 2 ways this can happen:
- Client mode: Where the driver process runs on the client machine. This is good for debugging and learning purposes. Spark-shell (the shell process is the driver) runs in this mode.
When you launch spark-shell the process (driver) starts on that very machine, but when you want to process something, the executors are launched on cluster nodes. - Cluster mode: The driver process and the executor processes are all launched on the nodes of the cluster. This is used in production environments.
Before getting any further on how the driver and cluster interact and how they are fed with resources, we need to answer one import question:
Who controls the Cluster?
Resource Manager. Currently there are 4 different kinds of resource manager that the spark supports
- Apache YARN
- Apache Mesos
- Kubernetes
- Standalone
Apache YARN is the cluster manager of hadoop, and is one of the most widely used cluster resource managers in the industry. I will discuss in details on how the YARN works with its architecture in a different post.
Apache Mesos is another option of RM, which could be used for managing cluster.
Kubernetes, this is a general purpose orchestration platform by Google. Currently this is getting very popular.
Standalone RM, is in-built in Apache spark, which makes setting up cluster easy. Mainly used for testing and learning purposes.
Now, Let get back to learning how the driver and cluster interact and how they are fed with resources.
As discussed earlier the there are 2 ways a job can be submitted to the cluster first where the job is submitted in client mode (where the driver is on the client side ex: spark-shell) and second is submitted in cluster mode (where the driver is launched on cluster node). In second mode, the jobs are submitted from client using "spark-submit" module of spark. Spark-submit module is API mode of submitting the job to the cluster. More on this on a different post.
Submitting job in client mode:
The following diagram shows the job submission process architecture when the spark-shell is used.
As you can see here, Spark-shell is running on the client machine. A spark session is launched in the shell. This is our spark driver proces.
The YARN resource manager is running on the cluster machine. Ths spark session upon receiving the request to submit a job, contacts the RM(resource manager). The YARN RM immediately launches a container process and an Application Master on it. The AM receives the config and job details. Based on the details it receives, it requests the RM for more resources(containers for executors) to run the job.
The YARN RM launches the containers for executors as per availability and request and reports it to the AM. The AM distributes the tasks to the executors. Executors runs the tasks and report it back to the driver process.
Submitting job in cluster mode:
The following diagram shows the job submission process architecture when the the driver process also runs on the cluster node.
You can clearly make out the difference here. The job submission happens from client machine via the spark-submit API to the YARN RM. Also, the AM process internally launches the spark Driver process on its container.
Rest of the process of executor creation and task processing is same as explained above.
Hope this post clears basic idea of how spark works. In my next post lets dig deeper and learn internal mechanics of Spark.
Comments
Post a Comment