Am currently building an ETL pipeline, which outputs tables of data (order of ~100+ GBs) to a downstream interactive dashboard, which allows filtering the data dynamically (based on pre-defined & indexed filters). Have zeroed in on using PySpark / Spark for the initial ETL phase. Next, this processed data will be summarised (simple counts, averages, etc.) & then visualised in the interactive dashboard. Towards the interactive querying part, I was wondering which tool might work best with my structured & transactional data (stored in Parquet format) - <ol> <li>Spark SQL (in memory dynamic querying)</li> <li>AWS Athena (Serverless SQL querying, based on Presto)</li> <li>Elastic Search (search engine)</li> <li>Redis (Key Value DB)</li> </ol> Feel free to suggest alternative tools, if you know of a better option.

Based on the information you've provided, I am going to make several assumptions: <ol> <li>You are on AWS (hence Elastic Search and Athena being options). Therefore, I will steer you to AWS documentation.</li> <li>As you have pre-defined and indexed filters, you have well ordered, structured data.</li> </ol> Going through the options listed <ol> <li>Spark SQL - If you are already considering Spark and you are already on AWS, then you can leverage AWS Elastic Map Reduce.</li> <li>AWS Athena (Serverless SQL querying, based on Presto) - Athena is a powerful tool. It lets you query data stored on S3, which is quite cost effective. However, building workflows in Athena can require a bit of work as you'll spend a lot of time managing files on S3. Historically, Athena can only produce CSV output, so it often works best as the final stage in a Big Data Pipeline. However, with support for CTAS statements, you can now output data in multiple formats such as Parquet with multiple compression algorithms.</li> <li> Elastic Search (search engine) - Is not really a query tool, so it is likely not part of the core of this pipeline.</li> <li>Redis (Key Value DB) - Redis is an in memory key-value data store. It is generally used to provide small bits of information to be rapidly consumed by applications in use cases such as caching and session management. Therefore, it does not seem to fit your use case. If you want some hands on experience with Redis, I recommend Try Redis.</li> </ol> I would also look into Amazon Redshift. For further reading, read Big Data Analytics Options on AWS. As @Damien_The_Unbeliever recommended, there will be no substitute for your own prototyping and benchmarking.

Athena is not limited to .csv. In fact using binary compressed formats like parquet are a best practice for use with Athena, because it substantially reduces query times and cost. I have used AWS firehose, lambda functions and glue crawlers for converting text data to a compressed binary format for querying via Athena. When I have had issues with processing large data volumes, the issue was forgetting to raise the default Athena limits set for the accounts. I have a friend who processes gigantic volumes of utility data for predictive analytics, and he did encounter scaling problems with Athena, but that was in its early days. I also work with ElasticSearch with Kibana as a text search engine and we use the AWS Log Analytics "solution" based on ElasticSearch and Kibana. I like both. Athena is best for working with huge volumes of log data, because it is more economical to work with it in a compressed binary format. A terabyte of JSON text data reduces down to approximately 30 gig or less in parquet format. Our developers are more productive when they use ElasticSearch/Kibana to analyze problems in their log files, because ElasticSeach and Kibana are so easy to use. The curator Lambda function that controls logging retention times and is a part of AWS Centralized logging is also very convenient.

Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch?

Tags:

performance

elasticsearch

apache-spark

etl

amazon-athena

Am currently building an ETL pipeline, which outputs tables of data (order of ~100+ GBs) to a downstream interactive dashboard, which allows filtering the data dynamically (based on pre-defined & indexed filters).

Have zeroed in on using PySpark / Spark for the initial ETL phase. Next, this processed data will be summarised (simple counts, averages, etc.) & then visualised in the interactive dashboard.

Towards the interactive querying part, I was wondering which tool might work best with my structured & transactional data (stored in Parquet format) -

Spark SQL (in memory dynamic querying)
AWS Athena (Serverless SQL querying, based on Presto)
Elastic Search (search engine)
Redis (Key Value DB)

Feel free to suggest alternative tools, if you know of a better option.

702

asked Dec 28 '17 04:12

vsdaking

2 Answers

Based on the information you've provided, I am going to make several assumptions:

You are on AWS (hence Elastic Search and Athena being options). Therefore, I will steer you to AWS documentation.
As you have pre-defined and indexed filters, you have well ordered, structured data.

Going through the options listed

Spark SQL - If you are already considering Spark and you are already on AWS, then you can leverage AWS Elastic Map Reduce.
AWS Athena (Serverless SQL querying, based on Presto) - Athena is a powerful tool. It lets you query data stored on S3, which is quite cost effective. However, building workflows in Athena can require a bit of work as you'll spend a lot of time managing files on S3. Historically, Athena can only produce CSV output, so it often works best as the final stage in a Big Data Pipeline. However, with support for CTAS statements, you can now output data in multiple formats such as Parquet with multiple compression algorithms.
Elastic Search (search engine) - Is not really a query tool, so it is likely not part of the core of this pipeline.
Redis (Key Value DB) - Redis is an in memory key-value data store. It is generally used to provide small bits of information to be rapidly consumed by applications in use cases such as caching and session management. Therefore, it does not seem to fit your use case. If you want some hands on experience with Redis, I recommend Try Redis.

I would also look into Amazon Redshift.

For further reading, read Big Data Analytics Options on AWS.

As @Damien_The_Unbeliever recommended, there will be no substitute for your own prototyping and benchmarking.

160

answered Oct 07 '22 00:10

Zerodf

Athena is not limited to .csv. In fact using binary compressed formats like parquet are a best practice for use with Athena, because it substantially reduces query times and cost. I have used AWS firehose, lambda functions and glue crawlers for converting text data to a compressed binary format for querying via Athena. When I have had issues with processing large data volumes, the issue was forgetting to raise the default Athena limits set for the accounts. I have a friend who processes gigantic volumes of utility data for predictive analytics, and he did encounter scaling problems with Athena, but that was in its early days.

I also work with ElasticSearch with Kibana as a text search engine and we use the AWS Log Analytics "solution" based on ElasticSearch and Kibana. I like both. Athena is best for working with huge volumes of log data, because it is more economical to work with it in a compressed binary format. A terabyte of JSON text data reduces down to approximately 30 gig or less in parquet format. Our developers are more productive when they use ElasticSearch/Kibana to analyze problems in their log files, because ElasticSeach and Kibana are so easy to use. The curator Lambda function that controls logging retention times and is a part of AWS Centralized logging is also very convenient.

answered Oct 07 '22 02:10

Terry

Related questions
                            
                                Why does separating my module into multiple files make it slower?
                            
                                Why is storeConfigInMeta: false not the default in Ember CLI?
                            
                                Why is implicit casting is much faster than explicit in JS? Is implicit casting a good practice?
                            
                                Go http server slow benchmark performance
                            
                                Quickly generating the "triangle sequence": avoiding mispredictions
                            
                                Idle time in frame rendering in Chrome DevTools
                            
                                Optimizing a row exclusion query
                            
                                What is the status of the TSX-related Skylake errata SKL-105?
                            
                                MySQL Performance - string vs integer
                            
                                Start Activity is slow and the new activity is empty
                            
                                Can I prevent users from modify the parameters in C#
                            
                                Is there any harm in putting comments first in C header file?
                            
                                Fast numpy roll
                            
                                More Elegant way to number a list according to values
                            
                                Linux perf reporting cache misses for unexpected instruction
                            
                                Pandas read_hdf very slow for non-numeric data
                            
                                Android Camera2 vs NDK Native Camera API
                            
                                Is a struct wrapping a primitive value type a zero cost abstraction in C#?
                            
                                Load fixtures one time before all phpunit test on symfony 3
                            
                                Qt QML application on IOS working slow, JIT is disabled

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With