Bigdata - Genomics | Data Science

Page 2

Building a real-time big data pipeline 5 : NoSQL, Java

Apache Cassandra is a distributed NoSQL database (DB) which is used for handling Big data and real-time web applications. NoSQL stands for “Not Only SQL” or “Not SQL”. NoSQL database is a non-relational data management system, that does not require a fixed schema. >>>

Building a real-time big data pipeline 4 : Spark Streaming, Kafka, Scala

Apache Kafka is a scalable, high performance and low latency platform for handling of real-time data feeds. Kafka allows reading and writing streams of data like a messaging system; written in Scala and Java.Kafka requires Apache Zookeeper to run. Kafka v2.5.0 (scala v2.12 build) and zookeeper (v3.4.13) were installed using docker. >>>

Building a real-time big data pipeline 3 : Spark SQL, Hadoop, Scala

Apache Spark is an open-source cluster computing system that provides high-level API in Java, Scala, Python and R.Spark also packaged with higher-level libraries for SQL, machine learning, streaming, and graphs. Spark SQL is Spark’s package for working with structured data. >>>

Building a real-time big data pipeline 2 : Spark Core, Hadoop, Scala

Apache Spark is a general-purpose, in-memory cluster computing engine for large scale data processing. Spark can also work with Hadoop and its modules. The real-time data processing capability makes Spark a top choice for big data analytics. The spark core has two parts. 1) Computing engine and 2) Spark Core APIs. >>>

Building a real-time big data pipeline 1 : Kafka, RESTful, Java

Building a real-time big data pipeline 1 : Kafka, RESTful, Java Updated on September 20, 2021 Apache Kafka is used for building real-time data pipelines and streaming apps. Kafka is a message broker, which helps transmit messages from one system to another. Zookeeper is required to run a Kafka Cluster. Apache ZooKeeper is primarily used