Cluster and AWS.


RNA-seq and proteomics




a real-time big data pipeline

Spatial gene expression data analysis on Cluster : 10X Genomics, Space Ranger

Running spaceranger as cluster mode that uses Sun Grid Engine (SGE) as queuing. There are 2 steps to analyze Spatial RNA-seq data. Step 1: spaceranger mkfastq demultiplexes raw base call (BCL) files generated by Illumina sequencers into FASTQ files. Step 2: spaceranger count takes FASTQ files from spaceranger mkfastq and performs alignment, filtering, barcode counting,
read more

Taxonomic and functional profiling of the microbiome – whole genome shotgun metagenomics

This workflow consists of taxonomic and functional profiling of shotgun metagenomics sequencing (MGS) reads using MetaPhlAn2 and HUMAnN2, respectively. To perform taxonomic (phyla, genera or species level) profiling of the MGS data, the MetaPhlAn2 pipeline was run on a high performance multicore cluster computing environment. >>>
read more

Building a real-time big data pipeline 9: Spark MLlib, Regression, Python

Apache Spark expresses parallelism by three sets of APIs – DataFrames, DataSets and RDDs (Resilient Distributed Dataset).Originally, spark was designed to read and write data from and to Hadoop Distributed File System (HDFS). A Hadoop cluster is composed of a network of master, worker and client nodes that orchestrate and execute the various jobs across
read more

Building a real-time big data pipeline 10: Spark Streaming, Kafka, Java

Spark Streaming is an extension of the core Apache Spark platform that enables scalable, high-throughput, fault-tolerant processing of data streams; written in Scala but offers Scala, Java, R and Python APIs to work with. It takes data from the sources like Kafka, Flume, Kinesis, HDFS, S3 or Twitter. This data can be further processed using
read more



Regenerative medicine


Rare diseases