Having established the problems with traditional Map Reduce and how Spark was born (You can check it out here if you have missed it), let us dive into understanding the Spark project a little deeper. Apache Spark is a fast and general-purpose engine for large scale data processing , under the hood it works on a cluster of computers . The ecosystem of Apache Spark can be summarized in the following image, As you can see, the spark ecosystem consists of majorly 2 constituents; Spark Core Set of libraries and APIs. One can notice that the spark does not have any inbuilt Cluster Management System and Distributed Storage system. Spark depends on third-party Cluster management system. It supports a variety of cluster managers such as 'YARN', 'MESOS' and 'Kubernetes'. Similarly, it depends on third-party distributed storage systems, such as 'HDFS', 'Amazon S3', 'Google Cloud Storage(GCS)' or 'Casandra File Sy...