Using Kafka to import data to Hadoop

Tags:

Firstly I was thinking what to use to get events into Hadoop, where they will be stored and periodically analysis would be performed on them (possibly using Ooozie to schedule periodic analysis) Kafka or Flume, and decided that Kafka is probably a better solution, since we also have a component that does event processing, so in this way, both batch and event processing components get data in the same way.

But know I'm looking for suggestions concretely how to get data out of broker to Hadoop.

I found here that Flume can be used in combination with Kafka

Flume - Contains Kafka Source (consumer) and Sink (producer)

And also found on the same page and in Kafka documentation that there is something called Camus

Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.

I'm interested in what would be a better (and easier, better documented solution) to do that? Also, are there any examples or tutorials how to do it?

When should I use this variants over simpler, High level consumer?

I'm opened for suggestions if there is another/better solution than this two.

Thanks

279

asked Nov 04 '14 12:11

Kobe-Wan Kenobi

1 Answers

You can use flume to dump data from Kafka to HDFS. Flume has kafka source and sink. Its a matter of property file change. An example is given below.

Steps:

Create a kafka topic

kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --     partitions 1 --topic testkafka

Write to the above created topic using kafka console producer

kafka-console-producer --broker-list localhost:9092 --topic testkafka

Configure a flume agent with the following properties

flume1.sources  = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks    = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = localhost:2181
flume1.sources.kafka-source-1.topic =testkafka
flume1.sources.kafka-source-1.batchSize = 100
flume1.sources.kafka-source-1.channels = hdfs-channel-1

flume1.channels.hdfs-channel-1.type   = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = test-events
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=100
flume1.sinks.hdfs-sink-1.hdfs.rollSize=0
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 1000

Save the above config file as example.conf

Run the flume agent

flume-ng agent -n flume1 -c conf -f example.conf -    Dflume.root.logger=INFO,console

Data will be now dumped to HDFS location under the following path

/tmp/kafka/%{topic}/%y-%m-%d

191

answered Sep 18 '22 00:09

sunitha

Related questions
                            
                                Hadoop, MapReduce - Multiple Input/Output Paths
                            
                                Why does full outer join in HIVE gives weird result when one of the join fields is missing?
                            
                                run Spark-Submit on YARN but Imbalance (only 1 node is working)
                            
                                Real-time analysis of event logs with Elasticsearch
                            
                                hive view with nested selects and partition pruning
                            
                                AWS Data Pipeline: Tez fails on simple HiveActivity
                            
                                Hive : How to explode a JSON column with an array, and embedded in a CSV file?
                            
                                Accessing hdfs from docker-hadoop-spark--workbench via zeppelin
                            
                                Any Good Opensource Analytics front end tool? [closed]
                            
                                How do you deal with empty or missing input files in Apache Pig?
                            
                                A way to read table data from Mysql to Pig
                            
                                is there any seqFileDir option for "clusterdump" in the latest "apache mahout" library?
                            
                                sample map reduce script in python for hive produces exception
                            
                                using JSON-SerDe in Hive tables
                            
                                Extracting an Array of Structs in Hive
                            
                                Pig 0.11.1 - Count groups in a time range
                            
                                InvalidProtocolBufferException when trying to write to HDFS
                            
                                Copy and extract files from s3 to HDFS
                            
                                How to read gz files in Spark using wholeTextFiles
                            
                                No space left on device exception, amazon EMR medium instances and S3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Kafka to import data to Hadoop

Tags:

apache-kafka

hadoop

flume

Kobe-Wan Kenobi

People also ask

1 Answers

sunitha

Recent Activity

Donate For Us