Building a real-time big data pipeline 10: Spark Streaming, Kafka, Java - Genomics

Spark Streaming is an extension of the core Apache Spark platform that enables scalable, high-throughput, fault-tolerant processing of data streams; written in Scala but offers Scala, Java, R and Python APIs to work with. It takes data from the sources like Kafka, Flume, Kinesis, HDFS, S3 or Twitter. This data can be further processed using complex algorithms. The final output, which is the processed data can be pushed out to destinations such as HDFS filesystems, databases, and live dashboards. Spark Streaming allows you to use Machine Learningapplications to the data streams for advanced data processing. Spark uses Hadoop’s client libraries for distributed storage (HDFS) and resource management (YARN).

>>>

Related Posts

Building a real-time big data pipeline 9: Spark MLlib, Regression, Python

Building a real-time big data pipeline 8: Spark MLlib, Regression, R

Building a real-time big data pipeline 7 : Spark MLlib, Regression, Java