Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon EMR and Spark streaming

Amazon EMR, Apache Spark 2.3, Apache Kafka, ~10 mln records per day.

Apache Spark used for processing events in batches by 5 minutes, once per day worker nodes are dying and AWS reprovision automatically the nodes. On reviewing the log messages it looks like no space in the nodes, but they are having about 1Tb storage there.

Did someone has the issues with storage space in cases when it should be more than enough?

I was thinking the log aggregation could not copy properly the logs to s3 bucket, that should be done automatically by spark process as I see.

What kind of the information should I provide to help to resolve this issue?

Thank you in advance!

like image 629
oivoodoo Avatar asked Oct 18 '18 18:10

oivoodoo


People also ask

What is Spark and EMR?

Apache Spark is a distributed processing framework and programming model that helps you do machine learning, stream processing, or graph analytics using Amazon EMR clusters. Similar to Apache Hadoop, Spark is an open-source, distributed processing system commonly used for big data workloads.

What is AWS Spark streaming?

The Real-Time Analytics with Spark Streaming solution is an AWS-provided reference implementation that automatically provisions and configures the AWS services necessary to start processing real-time and batch data in minutes. The solution is designed to work with customers' Spark Streaming applications.

Does Amazon use Apache Spark?

EMR features Amazon EMR runtime for Apache Spark, a performance-optimized runtime environment for Apache Spark that is active by default on Amazon EMR clusters.

What EMR does Amazon use?

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark , on AWS to process and analyze vast amounts of data.


1 Answers

I had a similar issue with a Structured Streaming app on EMR, and disk space rapidly increasing to the point of stalling/crashing application.

In my case the fix was to disable the Spark Event log:

spark.eventLog.enabled to false

http://queirozf.com/entries/spark-streaming-commong-pitfalls-and-tips-for-long-running-streaming-applications#aws-emr-only-event-logs-under-hdfs-var-log-spark-apps-when-using-a-history-server

like image 135
bp2010 Avatar answered Sep 24 '22 02:09

bp2010