Apache Spark expresses parallelism by three sets of APIs – DataFrames, DataSets and RDDs (Resilient Distributed Dataset).Originally, spark was designed to read and write data from and to Hadoop Distributed File System (HDFS). A Hadoop cluster is composed of a network of master, worker and client nodes that orchestrate and execute the various jobs across the HDFS.
Related Posts
-
Building a real-time big data pipeline 10: Spark Streaming, Kafka, Java
Spark Streaming is an extension of the core Apache Spark platform that enables scalable, high-throughput, -
Building a real-time big data pipeline 8: Spark MLlib, Regression, R
Apache Spark MLlib is a distributed framework that provides many utilities useful for machine learning tasks, -
Building a real-time big data pipeline 7 : Spark MLlib, Regression, Java
Apache Spark MLlib is a distributed framework that provides many utilities useful for machine learning