2020 - Genomics | Data Science

2020

Building a real-time big data pipeline 8: Spark MLlib, Regression, R

Apache Spark MLlib is a distributed framework that provides many utilities useful for machine learning tasks, such as: Classification, Regression, Clustering, Dimentionality reduction and, Linear algebra, statistics and data handling. R is a popular statistical programming language with a number of packages that support data processing and machine learning tasks. To address R’s scalability issue, the

Building a real-time big data pipeline 7 : Spark MLlib, Regression, Java

Building a real-time big data pipeline 6: Spark Core, Hadoop, SBT

Apache Spark is an open-source cluster computing system that provides high-level APIs in Java, Scala, Python and R. Spark also packaged with higher-level libraries for SQL, machine learning (MLlib), streaming, and graphs (GraphX). >>>

Building a real-time big data pipeline 5 : NoSQL, Java

Apache Cassandra is a distributed NoSQL database (DB) which is used for handling Big data and real-time web applications. NoSQL stands for “Not Only SQL” or “Not SQL”. NoSQL database is a non-relational data management system, that does not require a fixed schema. >>>

Building a real-time big data pipeline 4 : Spark Streaming, Kafka, Scala

Apache Kafka is a scalable, high performance and low latency platform for handling of real-time data feeds. Kafka allows reading and writing streams of data like a messaging system; written in Scala and Java.Kafka requires Apache Zookeeper to run. Kafka v2.5.0 (scala v2.12 build) and zookeeper (v3.4.13) were installed using docker. >>>