2021 - Genomics | Data Science

2021

Building a real-time big data pipeline 9: Spark MLlib, Regression, Python

Apache Spark expresses parallelism by three sets of APIs – DataFrames, DataSets and RDDs (Resilient Distributed Dataset).Originally, spark was designed to read and write data from and to Hadoop Distributed File System (HDFS). A Hadoop cluster is composed of a network of master, worker and client nodes that orchestrate and execute the various jobs across

Single cell gene expression data analysis on Cluster : 10X Genomics, Cell Ranger

Cell Ranger can be run in cluster mode, using job schedulers like Sun Grid Engine (or simply SGE) or Load Sharing Facility (or simply LSF) as queuing system allows highly parallelizable jobs. >>>

Annotation of genetic variants

Tools such as ANNOVAR, Variant Effect Predictor (VEP) or SnpEff annotate genetic variants (SNPs, INDELS, CNVs etc) present in VCF file. These tools integrate the annotations within the INFO column of the original VCF file. >>>

Building a real-time big data pipeline 10: Spark Streaming, Kafka, Java

Spark Streaming is an extension of the core Apache Spark platform that enables scalable, high-throughput, fault-tolerant processing of data streams; written in Scala but offers Scala, Java, R and Python APIs to work with. It takes data from the sources like Kafka, Flume, Kinesis, HDFS, S3 or Twitter. This data can be further processed using