Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka to Elasticsearch, HDFS with Logstash or Kafka Streams/Connect

I use Kafka for message queue/processing. My question is about performance/best practice. I will do my own performance tests but maybe someone has results/experience already.

The data is raw in a Kafka (0.10) topic and I want to transfer it structured to ES and HDFS.

Now I see 2 possibilities:

  • Logstash (Kafka input plugin, grok filter (parsing), ES/webhdfs output plugin)
  • Kafka Streams (parsing), Kafka Connect (ES sink, HDFS sink)

Without any tests I would say that the second option is better/cleaner and more reliable?

like image 733
imehl Avatar asked Nov 08 '22 06:11

imehl


1 Answers

Logstash "best practice" for getting data into Elasticsearch. WebHDFS won't have the raw performance of the Java API that is part of the Kafka Connect plugin, however.

Grok could be done in a Kafka Streams process, so your parsing could be done in either location.

If you are on an Elastic subscription, then they would like to sell Logstash. Confluent would like to sell Kafka Streams + Kafka Connect.

Avro seems to be the best medium for data transfer, and the Schema Registry is a popular way to do that. IIUC, Logstash doesn't work well with a Schema Registry or Avro, and prefers JSON.


In the Hadoop landscape, I would offer the intermediate options of Apache Nifi or Streamsets.

In the end, it really depends on your priorities, and how well you (and your team) can support these tools.

like image 95
OneCricketeer Avatar answered Nov 15 '22 09:11

OneCricketeer