{"id":574,"date":"2020-04-05T10:17:53","date_gmt":"2020-04-05T14:17:53","guid":{"rendered":"http:\/\/sys4seq.com\/?p=22"},"modified":"2022-10-18T16:27:32","modified_gmt":"2022-10-18T20:27:32","slug":"big_data_pipeline_1","status":"publish","type":"post","link":"https:\/\/sys4seq.com\/index.php\/2020\/04\/05\/big_data_pipeline_1\/","title":{"rendered":"Building a real-time big data pipeline 1 : Kafka, RESTful, Java"},"content":{"rendered":"<h3>Building a real-time big data pipeline 1 : Kafka, RESTful, Java<\/h3>\n<p>Updated on September 20, 2021<\/p>\n<p>Apache <a href=\"https:\/\/kafka.apache.org\/\" target=\"_blank\" rel=\"noopener\">Kafka<\/a> is used for building real-time data pipelines and streaming apps. Kafka is a message broker, which helps transmit messages from one system to another. <a href=\"https:\/\/zookeeper.apache.org\/\" target=\"_blank\" rel=\"noopener\">Zookeeper<\/a> is required to run a Kafka Cluster. Apache ZooKeeper is primarily used to track the status of nodes in the Kafka Cluster and maintain a list of Kafka topics and messages. Starting with v2.8, Kafka can be run without ZooKeeper. However, this update isn\u2019t ready for use in production.<\/p>\n<p><strong>Kafka for local development of applications:<\/strong>.<\/p>\n<p>There are multiple ways of running Kafka locally for development of apps but the easiest method is by <code><em>docker-compose<\/em><\/code>. First download Docker Desktop,<a href=\"https:\/\/hub.docker.com\/\" target=\"_blank\" rel=\"noopener\">Docker Hub<\/a>and SignIn with Docker ID.<\/ul>\n<p><strong>Docker Compose<\/strong> is included as part of this desktop.<em>docker-compose<\/em>facilitates installing Kafka and Zookeeper with the help of<code><em>docker-compose.yml<\/em><\/code><br \/>\nfile.<\/p>\n<p>Create a file called <em>docker-compose.yml<\/em> in your project directory and paste the following lines of code:<br \/>\n<img loading=\"lazy\" src=\"http:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/docker_compose-300x259.png\" alt=\"\" width=\"300\" height=\"259\" class=\"alignnone size-medium wp-image-2146\" srcset=\"https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/docker_compose-300x259.png 300w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/docker_compose-768x663.png 768w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/docker_compose.png 922w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>The above compose file defines two services: zookeeper and kafka. The zookeeper service uses a public zookeeper image and the kafka service uses a public kafka image pulled from the Docker Hub registry.  <\/p>\n<p><strong>1. Start the Kafka service<\/strong>.<br \/>\nOpen a terminal, go to the directory where you have the <em>docker-compose.yml<\/em> file, and execute the following command. This command starts the docker-compose engine, and downloads the images and runs them.  <\/p>\n<blockquote><p><code><em>$docker compose up -d<\/em> <\/code><\/p><\/blockquote>\n<p>To list all running docker containers, run the following command<\/p>\n<blockquote><p><code>$docker compose ps<\/code> <\/p><\/blockquote>\n<p>You can shut down docker-compose by executing the following command in another terminal.<\/p>\n<blockquote><p><code>$docker compose down<\/code><\/p><\/blockquote>\n<p>Start the kafka service and check the ZooKeeper logs to verify that ZooKeeper is working and healthy.<\/p>\n<blockquote><p><code>$docker compose logs zookeeper | grep -i binding<\/code> <\/p><\/blockquote>\n<p>Next, check the Kafka logs to verify that broker is working and healthy.<\/p>\n<blockquote><p><code>$docker compose logs kafka | grep -i started<\/code><\/p><\/blockquote>\n<p>Two fundamental concepts in Kafka are <em>Topics<\/em> and <em>Partitions<\/em>. Kafka topics are divided into a number of partitions. While the topic is a logical concept in Kafka, a partition is the smallest storage unit that holds a subset of records owned by a topic. Each partition is a single log file where records are written to it in an append-only fashion.<\/p>\n<p><strong>2. Create a Kafka topic<\/strong><br \/>\nThe Kafka cluster stores streams of records in categories called topics. Each record in a topic consists of a key, a value, and a timestamp. A topic can have zero, one, or many consumers that subscribe to the data written to it.<\/p>\n<p>Use docker <em>compose exec<\/em> to execute a command in a running container. For example, <em>docker compose exec<\/em> command by default allocates a TTY, so that you can use such a command to get an interactive prompt. Go into directory where <em>docker-compose.yml<\/em> file present, and execute it as<\/p>\n<blockquote><p><code>$docker compose exec kafka bash<\/code> <\/p><\/blockquote>\n<p>(for zookeeper <em>$docker compose exec zookeeper bash<\/em>)<\/p>\n<p>Change the directory to \/opt\/kafka\/bin where you find scripts such as <em>kafka-topics.sh<\/em>.<\/p>\n<blockquote><p><code>cd \/opt\/kafka\/bin<\/code><\/p><\/blockquote>\n<p>Kafka topics are divided into a number of partitions. Each partition in a topic is an ordered, immutable sequence of records that continually appended.<\/p>\n<p>Create a new topic or, list or delete existing topics:<\/p>\n<blockquote><p><img loading=\"lazy\" src=\"http:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/create_kafka_topic-300x160.png\" alt=\"\" width=\"300\" height=\"160\" class=\"alignnone size-medium wp-image-2151\" srcset=\"https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/create_kafka_topic-300x160.png 300w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/create_kafka_topic.png 642w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p><\/blockquote>\n<blockquote><p><code>bash-4.4# .\/kafka-topics.sh --list --bootstrap-server localhost:9092<\/code> <\/p><\/blockquote>\n<p>If necessary, delete a topic using the following command.<\/p>\n<blockquote><p><code>bash-4.4# .\/kafka-topics.sh --delete --topic mytopic --bootstrap-server localhost:9092<\/code><\/p><\/blockquote>\n<p><strong>3. How to produce and consume messages from kafka Topic using the command line?<\/strong><br \/>\nKafka producers are those client applications that publish (write) events to Kafka, and Kafka consumers are those that subscribe (read and process) to these events. A producer can use a partition key to direct messages to a specific partition. If a producer doesn\u2019t specify a partition key when producing a record, Kafka will use a round-robin partition assignment.<\/p>\n<p>Kafka broker (a.k.a Kafka server\/node) is the server node in the cluster, mainly responsible for hosting partitions of Kafka topics, transferring messages from Kafka producer to Kafka consumer and, providing data replication and partitioning within a Kafka Cluster.<\/p>\n<p>The following is a producer command line to read data from standard input and write it to a Kafka topic.<br \/>\n<img loading=\"lazy\" src=\"http:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/kafka_producer-300x174.png\" alt=\"\" width=\"300\" height=\"174\" class=\"alignnone size-medium wp-image-2155\" srcset=\"https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/kafka_producer-300x174.png 300w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/kafka_producer.png 658w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>The following is a command line to read data from a Kafka topic and write it to standard output.<br \/>\n<img loading=\"lazy\" src=\"http:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/Kafka_consumer-300x171.png\" alt=\"\" width=\"300\" height=\"171\" class=\"alignnone size-medium wp-image-2156\" srcset=\"https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/Kafka_consumer-300x171.png 300w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/Kafka_consumer.png 652w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><strong>4. How to consume messages from kafka Topic using java web application?<\/strong><\/p>\n<p>Another way of reading data from a Kafka topic is by simply using a <em>Java Spring Boot<\/em>.<\/p>\n<p>The following demonstrates how to receive messages from Kafka topic. First in this blog I create a Spring Kafka Consumer, which is able to listen the messages sent to a Kafka topic using the  commandline. Then, I create a Spring Kafka Producer, which is able to send messages to a Kafka topic.<\/p>\n<p>Download <em><a href=\"https:\/\/spring.io\/tools\" rel=\"noopener\" target=\"_blank\">Spring Tool Suite4<\/a><\/em> and install it.<br \/>\nAt Eclipse IDE\u2019s Package Explorer click \u201c<em>Create new Spring Starter Project<\/em>\u201d and<br \/>\nName: SpringBootKafka<br \/>\nType: Maven project<br \/>\nSpring Boot Version: 2.5.4<br \/>\nJava Version: 8 Search \u201ckafka\u201d at New Spring Starter Project Dependencies and select \u201cSpring for Apache Kafka\u201c<br \/>\nClick Finish.<\/p>\n<p>The Spring Initializr creates the following simple application class for you.<\/p>\n<p><em>SpringBootKafkaApplication.java<\/em><\/p>\n<p><img loading=\"lazy\" src=\"http:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/spring_initlizr-300x89.png\" alt=\"\" width=\"300\" height=\"89\" class=\"alignnone size-medium wp-image-2165\" srcset=\"https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/spring_initlizr-300x89.png 300w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/spring_initlizr-1024x305.png 1024w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/spring_initlizr-768x229.png 768w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/spring_initlizr.png 1102w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><br \/>\n<em>@SpringBootApplication<\/em> annotation is equivalent to using <em>@Configuration<\/em>, <em>@EnableAutoConfiguration<\/em>, and <em>@ComponentScan<\/em>. As a result, when we run this Spring Boot application, it will automatically scan the components in the current package and its sub-packages. Thus it will register them in Spring\u2019s Application Context, and allow us to inject beans using <em>@Autowired<\/em>.<\/p>\n<p><strong>Configure Kafka through <em>application.yml<\/em> configuration file<\/strong><br \/>\nIn Spring Boot, properties are kept in the <em>application.properties<\/em> file under the classpath. The <em>application.properties<\/em> file is located in the <em>src\/main\/resources<\/em> directory. Change <em>application.properties<\/em> file to <em>application.yml<\/em>, then add the following content.<br \/>\n<img loading=\"lazy\" src=\"http:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/application_yml-300x88.png\" alt=\"\" width=\"300\" height=\"88\" class=\"alignnone size-medium wp-image-2170\" srcset=\"https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/application_yml-300x88.png 300w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/application_yml-1024x302.png 1024w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/application_yml-768x226.png 768w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/application_yml-1536x453.png 1536w, https:\/\/sys4seq.com\/wp-content\/uploads\/2020\/04\/application_yml.png 1872w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><a href=\"https:\/\/adinasarapu.github.io\/posts\/2020\/01\/blog-post-kafka\/\" target=\"_blank\" rel=\"noopener\">&gt;&gt;&gt;<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"Building a real-time big data pipeline 1 : Kafka, RESTful, Java Updated on September 20, 2021 Apache Kafka is used for building real-time data pipelines and streaming apps. Kafka is a message broker, which helps transmit messages from one system to another. Zookeeper is required to run a Kafka Cluster. Apache ZooKeeper is primarily used","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[44,43],"tags":[46,45,47],"_links":{"self":[{"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/posts\/574"}],"collection":[{"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/comments?post=574"}],"version-history":[{"count":45,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/posts\/574\/revisions"}],"predecessor-version":[{"id":2171,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/posts\/574\/revisions\/2171"}],"wp:attachment":[{"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/media?parent=574"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/categories?post=574"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sys4seq.com\/index.php\/wp-json\/wp\/v2\/tags?post=574"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}