Spark Streaming is an extension of the core Apache Spark platform that enables scalable, high-throughput, fault-tolerant processing of data streams; written in Scala but offers Scala, Java, R and Python APIs to work with. It takes data from the sources like Kafka, Flume, Kinesis, HDFS, S3 or Twitter. This data can be further processed using complex algorithms. The final output, which is the processed data can be pushed out to destinations such as HDFS filesystems, databases, and live dashboards. Spark Streaming allows you to use Machine Learningapplications to the data streams for advanced data processing. Spark uses Hadoop’s client libraries for distributed storage (HDFS) and resource management (YARN).
Related Posts
-
Building a real-time big data pipeline 9: Spark MLlib, Regression, Python
Apache Spark expresses parallelism by three sets of APIs – DataFrames, DataSets and RDDs (Resilient -
Building a real-time big data pipeline 8: Spark MLlib, Regression, R
Apache Spark MLlib is a distributed framework that provides many utilities useful for machine learning tasks, -
Building a real-time big data pipeline 7 : Spark MLlib, Regression, Java
Apache Spark MLlib is a distributed framework that provides many utilities useful for machine learning