We have in our hadoop cluster Spark Batch jobs and and Spark streaming jobs. We would like to schedule and manage them both on the same platform. We came across airflow, Which fits our need for a "platform to author, schedule, and monitor workflows". I just want to be able to stop and start spark streaming job. Using airflow graphs and profiling is less of an issue. My question is, Beside losing some functionality(graphs, profiling) , Why shouldn't I use Airflow to run spark streaming jobs? I came across this question : Can airflow be used to run a never ending task? which says it's possible and not why you shouldn't.

@mMorozonv's Looks good. You could have one DAG start the stream if it does not exist. Then a second DAG as a health checker to track it's progress. If the health check fails you could trigger the first DAG again. Alternatively you can run the stream with a <code>trigger</code> interval of <code>once</code>[1]. <pre class="prettyprint lang-py prettyprint-override"><code># Load your Streaming DataFrame sdf = spark.readStream.load(path="data/", format="json", schema=my_schema) # Perform transformations and then write… sdf.writeStream.trigger(once=True).start(path="/out/path", format="parquet") </code></pre> This gives you all the same benefits of spark streaming, with the flexibility of batch processing. You can simply point the stream at your data and this job will detect all the new files since the last iteration (using checkpointing), run a streaming batch, then terminate. You could trigger your airflow DAG's schedule to suit whatever lag you'd like to process data at (every minute, hour, etc.). I wouldn't recommend this for low latency requirements, but its very suitable to be run every minute. [1] https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

Using Airflow branching functionality we can have one dag which will do both scheduling and monitoring of our streaming job. Dag will do a status check of the application and in case application is not running dag will submit a streaming job. In another case dag execution can be finished or you can add a sensor which will check streaming job status after some time with alerts and other stuff you need. There are two main problems: <ol> <li>Submit streaming application without waiting until it will be finished. Otherwise our operator will run until it will reach <code>execution_timeout</code>;</li> </ol> That problem can be solved by scheduling out streaming job under <code>cluster</code> mode with <code>spark.yarn.submit.waitAppCompletion</code> configuration parameter set to<code>false</code> <ol start="2"> <li>Check the status of our streaming operator;</li> </ol> We can check streaming application status using Yarn. For example we can use command <code>yarn application -list -appStates RUNNING</code> . In case our application will be among the list of running applications we should no trigger our streaming job. The only thing is to make streaming job name unique.

Using airflow to run spark streaming jobs?

2 Answers

@mMorozonv's Looks good. You could have one DAG start the stream if it does not exist. Then a second DAG as a health checker to track it's progress. If the health check fails you could trigger the first DAG again.

Alternatively you can run the stream with a trigger interval of once[1].

# Load your Streaming DataFrame
sdf = spark.readStream.load(path="data/", format="json", schema=my_schema)
# Perform transformations and then write…
sdf.writeStream.trigger(once=True).start(path="/out/path", format="parquet")

This gives you all the same benefits of spark streaming, with the flexibility of batch processing.

You can simply point the stream at your data and this job will detect all the new files since the last iteration (using checkpointing), run a streaming batch, then terminate. You could trigger your airflow DAG's schedule to suit whatever lag you'd like to process data at (every minute, hour, etc.).

I wouldn't recommend this for low latency requirements, but its very suitable to be run every minute.

[1] https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

129

answered Nov 01 '22 22:11

Ryan

Using Airflow branching functionality we can have one dag which will do both scheduling and monitoring of our streaming job. Dag will do a status check of the application and in case application is not running dag will submit a streaming job. In another case dag execution can be finished or you can add a sensor which will check streaming job status after some time with alerts and other stuff you need.

There are two main problems:

Submit streaming application without waiting until it will be finished. Otherwise our operator will run until it will reach execution_timeout;

That problem can be solved by scheduling out streaming job under cluster mode with spark.yarn.submit.waitAppCompletion configuration parameter set tofalse

Check the status of our streaming operator;

We can check streaming application status using Yarn. For example we can use command yarn application -list -appStates RUNNING . In case our application will be among the list of running applications we should no trigger our streaming job. The only thing is to make streaming job name unique.

answered Nov 01 '22 22:11

Aleksejs R

Related questions
                            
                                How to change user in hdfs using sparkSubmit in java
                            
                                Spark how to use a UDF with a Join
                            
                                How to validate Spark SQL expression without executing it?
                            
                                how to process data in chunks/batches with kafka streams?
                            
                                Spark: UDF executed many times
                            
                                Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
                            
                                How do you perform blocking IO in apache spark job?
                            
                                How to convert matrix to RDD[Vector] in spark
                            
                                java.lang.NoSuchMethodError Jackson databind and Spark
                            
                                Hadoop 2.6 Connecting to ResourceManager at /0.0.0.0:8032
                            
                                Apply function to each row of Spark DataFrame
                            
                                Multiple Spark applications with HiveContext
                            
                                How to optimize spark sql to run it in parallel
                            
                                snakeyaml and spark results in an inability to construct objects
                            
                                Reading in multiple files compressed in tar.gz archive into Spark [duplicate]
                            
                                Spark is not using all configured memory
                            
                                Why Does Spark Query (Load) from Oracle Is So Slow Comparing to SQOOP?
                            
                                Livy Server: return a dataframe as JSON?
                            
                                Online learning of LDA model in Spark
                            
                                Can Spark read data directly into a nested case class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using airflow to run spark streaming jobs?

Tags:

apache-spark

streaming

airflow

Gilad

People also ask

2 Answers

Ryan

Aleksejs R

Recent Activity

Donate For Us