I use Kafka for message queue/processing. My question is about performance/best practice. I will do my own performance tests but maybe someone has results/experience already.
The data is raw in a Kafka (0.10) topic and I want to transfer it structured to ES and HDFS.
Now I see 2 possibilities:
Without any tests I would say that the second option is better/cleaner and more reliable?
Logstash "best practice" for getting data into Elasticsearch. WebHDFS won't have the raw performance of the Java API that is part of the Kafka Connect plugin, however.
Grok could be done in a Kafka Streams process, so your parsing could be done in either location.
If you are on an Elastic subscription, then they would like to sell Logstash. Confluent would like to sell Kafka Streams + Kafka Connect.
Avro seems to be the best medium for data transfer, and the Schema Registry is a popular way to do that. IIUC, Logstash doesn't work well with a Schema Registry or Avro, and prefers JSON.
In the Hadoop landscape, I would offer the intermediate options of Apache Nifi or Streamsets.
In the end, it really depends on your priorities, and how well you (and your team) can support these tools.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With