Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make Spark Streaming (Spark 1.0.0) read the latest data from Kafka (Kafka Broker 0.8.1)

My spark streaming application fetches data from Kafka and do processing on them.

In case of application failure, huge amounts of data are stored in Kafka and at the next start-up of Spark Streaming application, it crashes because too much data is consumed at once. Since my application does not concern about the past data, it's totally fine to consume the current(latest) data only.

I found "auto.reset.offest" option and it behaves little different in Spark. It deletes the offsets stored in zookeeper, if it is configured. Despite however, its unexpected behavior, it is supposed to fetch data from the latest after deletion.

But I found it's not. I saw all the offsets are cleaned up before consuming the data. Then, because of default behavior, it should fetch the data as expected. But it still crashes due to too much data.

When I clean up the offset and consume data from the latest using "Kafka-Console-Consumer", and run my application, it works as expected.

So it looks "auto.reset.offset" does not work and kafka consumer in spark streaming fetches data from the "smallest" offset as default.

Do you have any idea on how to consume Kafka data from the latest in spark streaming?

I am using spark-1.0.0 and Kafka-2.10-0.8.1.

Thanks in advance.

like image 575
style95 Avatar asked Aug 26 '14 09:08

style95


1 Answers

I think you misspelled the property name. The correct key is auto.offset.reset instead of auto.reset.offest

More info here : http://kafka.apache.org/documentation.html#configuration

Hope this helps.

like image 184
ajnavarro Avatar answered Nov 15 '22 11:11

ajnavarro