{"id":1857,"date":"2021-11-08T12:49:22","date_gmt":"2021-11-08T16:49:22","guid":{"rendered":"http:\/\/sys4seq.com\/?page_id=1857"},"modified":"2022-06-08T11:37:07","modified_gmt":"2022-06-08T15:37:07","slug":"big-data","status":"publish","type":"page","link":"https:\/\/sys4seq.com\/index.php\/big-data\/","title":{"rendered":"Big Data"},"content":{"rendered":"<div id=\"cherry-posts-list-1\" class=\"cherry-posts-list template-default  \"><div class=\"cherry-posts-item post-item item-0 odd\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2021\/12\/21\/building-a-real-time-big-data-pipeline-9-spark-mllib-regression-python\/\" title=\"Building a real-time big data pipeline 9: Spark MLlib, Regression, Python\">Building a real-time big data pipeline 9: Spark MLlib, Regression, Python<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2021-12-21T16:36:25-04:00\">December 21, 2021<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span>  <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Apache Spark expresses parallelism by three sets of APIs &#8211; DataFrames, DataSets and RDDs (Resilient Distributed Dataset).Originally, spark was designed to read and write data from and to Hadoop Distributed File System (HDFS). A Hadoop cluster is composed of a network of master, worker and client nodes that orchestrate and execute the various jobs across<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2021\/12\/21\/building-a-real-time-big-data-pipeline-9-spark-mllib-regression-python\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/machine-learning\/\">Machine learning<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/python\/\">Python<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/spark-mllib\/\">Spark MLlib<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><div class=\"cherry-posts-item post-item item-1 even\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2021\/01\/19\/building-a-real-time-big-data-pipeline-10-spark-streaming-kafka-java\/\" title=\"Building a real-time big data pipeline 10: Spark Streaming, Kafka, Java\">Building a real-time big data pipeline 10: Spark Streaming, Kafka, Java<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2021-01-19T16:43:22-04:00\">January 19, 2021<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span>  <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Spark Streaming is an extension of the core Apache Spark platform that enables scalable, high-throughput, fault-tolerant processing of data streams; written in Scala but offers Scala, Java, R and Python APIs to work with. It takes data from the sources like Kafka, Flume, Kinesis, HDFS, S3 or Twitter. This data can be further processed using<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2021\/01\/19\/building-a-real-time-big-data-pipeline-10-spark-streaming-kafka-java\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/java\/\">Java<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/kafka\/\">Kafka<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/spark-streaming\/\">Spark Streaming<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><div class=\"cherry-posts-item post-item item-2 odd\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2020\/10\/04\/building-a-real-time-big-data-pipeline-8-spark-mllib-regression-r\/\" title=\"Building a real-time big data pipeline 8: Spark MLlib, Regression, R\">Building a real-time big data pipeline 8: Spark MLlib, Regression, R<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2020-10-04T17:45:20-04:00\">October 4, 2020<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span>  <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Apache Spark MLlib\u00a0is a distributed framework that provides many utilities useful for machine learning tasks, such as: Classification, Regression, Clustering, Dimentionality reduction and, Linear algebra, statistics and data handling. R is a popular statistical programming language with a number of packages that support data processing and machine learning tasks. To address R\u2019s scalability issue, the<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2020\/10\/04\/building-a-real-time-big-data-pipeline-8-spark-mllib-regression-r\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/machine-learning\/\">Machine learning<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/r\/\">R<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/spark-mllib\/\">Spark MLlib<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><div class=\"cherry-posts-item post-item item-3 even\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2020\/08\/24\/building-a-real-time-big-data-pipeline-7-spark-mllib-regression-java\/\" title=\"Building a real-time big data pipeline 7 : Spark MLlib, Regression, Java\">Building a real-time big data pipeline 7 : Spark MLlib, Regression, Java<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2020-08-24T17:21:49-04:00\">August 24, 2020<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span>  <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Apache Spark MLlib is a distributed framework that provides many utilities useful for machine learning tasks, such as: Classification, Regression, Clustering, Dimentionality reduction and, Linear algebra, statistics and data handling. &gt;&gt;&gt; &nbsp;<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2020\/08\/24\/building-a-real-time-big-data-pipeline-7-spark-mllib-regression-java\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/java\/\">Java<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/regression\/\">Regression<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/spark-mllib\/\">Spark MLlib<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><div class=\"cherry-posts-item post-item item-4 odd\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2020\/08\/18\/building-a-real-time-big-data-pipeline-6-spark-core-hadoop-sbt\/\" title=\"Building a real-time big data pipeline 6: Spark Core, Hadoop, SBT\">Building a real-time big data pipeline 6: Spark Core, Hadoop, SBT<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2020-08-18T11:52:32-04:00\">August 18, 2020<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span>  <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Apache Spark\u00a0is an open-source cluster computing system that provides high-level APIs in Java, Scala, Python and R. Spark also packaged with higher-level libraries for SQL, machine learning (MLlib), streaming, and graphs (GraphX). &gt;&gt;&gt; &nbsp;<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2020\/08\/18\/building-a-real-time-big-data-pipeline-6-spark-core-hadoop-sbt\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/hadoop\/\">Hadoop<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/sbt\/\">SBT<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/spark-core\/\">Spark Core<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><div class=\"cherry-posts-item post-item item-5 even\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2020\/08\/08\/building-a-real-time-big-data-pipeline-5-nosql-java\/\" title=\"Building a real-time big data pipeline 5 : NoSQL, Java\">Building a real-time big data pipeline 5 : NoSQL, Java<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2020-08-08T10:53:38-04:00\">August 8, 2020<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span>  <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Apache Cassandra is a distributed NoSQL database (DB) which is used for handling Big data and real-time web applications. NoSQL stands for \u201cNot Only SQL\u201d or \u201cNot SQL\u201d. NoSQL database is a non-relational data management system, that does not require a fixed schema. &gt;&gt;&gt;<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2020\/08\/08\/building-a-real-time-big-data-pipeline-5-nosql-java\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/java\/\">Java<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/nosql\/\">NoSQL<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><div class=\"cherry-posts-item post-item item-6 odd\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2020\/07\/04\/building-a-real-time-big-data-pipeline-4-spark-streaming-kafka-scala\/\" title=\"Building a real-time big data pipeline 4 : Spark Streaming, Kafka, Scala\">Building a real-time big data pipeline 4 : Spark Streaming, Kafka, Scala<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2020-07-04T06:44:08-04:00\">July 4, 2020<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span>  <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Apache Kafka is a scalable, high performance and low latency platform for handling of real-time data feeds. Kafka allows reading and writing streams of data like a messaging system; written in Scala and Java.Kafka requires Apache Zookeeper to run. Kafka v2.5.0 (scala v2.12 build) and zookeeper (v3.4.13) were installed using docker. &gt;&gt;&gt;<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2020\/07\/04\/building-a-real-time-big-data-pipeline-4-spark-streaming-kafka-scala\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/kafka\/\">Kafka<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/scala\/\">Scala<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/spark-streaming\/\">Spark Streaming<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><div class=\"cherry-posts-item post-item item-7 even\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2020\/06\/22\/building-a-real-time-big-data-pipeline-3-spark-sql-hadoop-scala\/\" title=\"Building a real-time big data pipeline 3 : Spark SQL, Hadoop, Scala\">Building a real-time big data pipeline 3 : Spark SQL, Hadoop, Scala<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2020-06-22T06:40:41-04:00\">June 22, 2020<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span>  <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Apache Spark is an open-source cluster computing system that provides high-level API in Java, Scala, Python and R.Spark also packaged with higher-level libraries for SQL, machine learning, streaming, and graphs. Spark SQL is Spark\u2019s package for working with structured data. &gt;&gt;&gt;<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2020\/06\/22\/building-a-real-time-big-data-pipeline-3-spark-sql-hadoop-scala\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/hadoop\/\">Hadoop<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/scala\/\">Scala<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/spark-sql\/\">Spark SQL<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><div class=\"cherry-posts-item post-item item-8 odd\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2020\/05\/07\/building-a-real-time-big-data-pipeline-2-spark-core-hadoop-scala\/\" title=\"Building a real-time big data pipeline 2 : Spark Core, Hadoop, Scala\">Building a real-time big data pipeline 2 : Spark Core, Hadoop, Scala<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2020-05-07T06:35:39-04:00\">May 7, 2020<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span>  <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Apache Spark is a general-purpose, in-memory cluster computing engine for large scale data processing. Spark can also work with Hadoop and its modules. The real-time data processing capability makes Spark a top choice for big data analytics. The spark core has two parts. 1) Computing engine and 2) Spark Core APIs. &gt;&gt;&gt;<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2020\/05\/07\/building-a-real-time-big-data-pipeline-2-spark-core-hadoop-scala\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/hadoop\/\">Hadoop<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/scala\/\">Scala<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/spark-core\/\">Spark Core<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><div class=\"cherry-posts-item post-item item-9 even\"><div class=\"inner cherry-clearfix\"><figure class=\"post-thumbnail\"><\/figure>\n<h4 class=\"post-title\"><a href=\"https:\/\/sys4seq.com\/index.php\/2020\/04\/05\/big_data_pipeline_1\/\" title=\"Building a real-time big data pipeline 1 : Kafka, RESTful, Java\">Building a real-time big data pipeline 1 : Kafka, RESTful, Java<\/a><\/h4>\n<div class=\"post-meta\">\n\tPosted on <time datetime=\"2020-04-05T10:17:53-04:00\">April 5, 2020<\/time> by <span class=\"post-author vcard\"><a href=\"https:\/\/sys4seq.com\/index.php\/author\/adinasarapu\/\" rel=\"author\">Ashok Dinasarapu<\/a><\/span> <span class=\"post-comments-link\"><a href=\"https:\/\/sys4seq.com\/index.php\/2020\/04\/05\/big_data_pipeline_1\/#comments\">1<\/a><\/span> <span class=\"post-tax post-tax-category\"><a href=\"https:\/\/sys4seq.com\/index.php\/category\/bigdata\/\">Bigdata<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/category\/software\/\">Software<\/a><\/span>\n<\/div>\n\n<div class=\"post-content part\">Building a real-time big data pipeline 1 : Kafka, RESTful, Java Updated on September 20, 2021 Apache Kafka is used for building real-time data pipelines and streaming apps. Kafka is a message broker, which helps transmit messages from one system to another. Zookeeper is required to run a Kafka Cluster. Apache ZooKeeper is primarily used<\/div>\n<a href=\"https:\/\/sys4seq.com\/index.php\/2020\/04\/05\/big_data_pipeline_1\/\" class=\"btn btn-default\">read more<\/a>\n<footer><span class=\"post-tax post-tax-post_tag\"><a href=\"https:\/\/sys4seq.com\/index.php\/tag\/java\/\">Java<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/kafka\/\">Kafka<\/a> <a href=\"https:\/\/sys4seq.com\/index.php\/tag\/restful\/\">RESTful<\/a><\/span><\/footer><\/div><\/div><!--\/.cherry-posts-item--><\/div><!--\/.cherry-posts-list-->\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"&nbsp;","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_mi_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"_links":{"self":[{"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/pages\/1857"}],"collection":[{"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/comments?post=1857"}],"version-history":[{"count":9,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/pages\/1857\/revisions"}],"predecessor-version":[{"id":2024,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/pages\/1857\/revisions\/2024"}],"wp:attachment":[{"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/media?parent=1857"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}