Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch?

Am currently building an ETL pipeline, which outputs tables of data (order of ~100+ GBs) to a downstream interactive dashboard, which allows filtering the data dynamically (based on pre-defined & indexed filters).

Have zeroed in on using PySpark / Spark for the initial ETL phase. Next, this processed data will be summarised (simple counts, averages, etc.) & then visualised in the interactive dashboard.

Towards the interactive querying part, I was wondering which tool might work best with my structured & transactional data (stored in Parquet format) -

  1. Spark SQL (in memory dynamic querying)
  2. AWS Athena (Serverless SQL querying, based on Presto)
  3. Elastic Search (search engine)
  4. Redis (Key Value DB)

Feel free to suggest alternative tools, if you know of a better option.

like image 702
vsdaking Avatar asked Dec 28 '17 04:12

vsdaking


People also ask

Can Athena query Parquet files?

Querying the Parquet file from AWS Athena Now that the data and the metadata are created, we can use AWS Athena to query the parquet file.

Why does Parquet work better than Spark?

Parquet has higher execution speed compared to other standard file formats like Avro,JSON etc and it also consumes less disk space in compare to AVRO and JSON.

What is the advantage of using Athena over RDS?

Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. On the other hand, Amazon RDS for Aurora is detailed as "MySQL and PostgreSQL compatible relational database with several times better performance".

Why is Athena so fast?

Amazon Athena is Amazon Web Services' fastest growing service – driven by increasing adoption of AWS data lakes, and the simple, seamless model Athena offers for querying huge datasets stored on Amazon using regular SQL.


2 Answers

Based on the information you've provided, I am going to make several assumptions:

  1. You are on AWS (hence Elastic Search and Athena being options). Therefore, I will steer you to AWS documentation.
  2. As you have pre-defined and indexed filters, you have well ordered, structured data.

Going through the options listed

  1. Spark SQL - If you are already considering Spark and you are already on AWS, then you can leverage AWS Elastic Map Reduce.
  2. AWS Athena (Serverless SQL querying, based on Presto) - Athena is a powerful tool. It lets you query data stored on S3, which is quite cost effective. However, building workflows in Athena can require a bit of work as you'll spend a lot of time managing files on S3. Historically, Athena can only produce CSV output, so it often works best as the final stage in a Big Data Pipeline. However, with support for CTAS statements, you can now output data in multiple formats such as Parquet with multiple compression algorithms.
  3. Elastic Search (search engine) - Is not really a query tool, so it is likely not part of the core of this pipeline.
  4. Redis (Key Value DB) - Redis is an in memory key-value data store. It is generally used to provide small bits of information to be rapidly consumed by applications in use cases such as caching and session management. Therefore, it does not seem to fit your use case. If you want some hands on experience with Redis, I recommend Try Redis.

I would also look into Amazon Redshift.

For further reading, read Big Data Analytics Options on AWS.

As @Damien_The_Unbeliever recommended, there will be no substitute for your own prototyping and benchmarking.

like image 160
Zerodf Avatar answered Oct 07 '22 00:10

Zerodf


Athena is not limited to .csv. In fact using binary compressed formats like parquet are a best practice for use with Athena, because it substantially reduces query times and cost. I have used AWS firehose, lambda functions and glue crawlers for converting text data to a compressed binary format for querying via Athena. When I have had issues with processing large data volumes, the issue was forgetting to raise the default Athena limits set for the accounts. I have a friend who processes gigantic volumes of utility data for predictive analytics, and he did encounter scaling problems with Athena, but that was in its early days.

I also work with ElasticSearch with Kibana as a text search engine and we use the AWS Log Analytics "solution" based on ElasticSearch and Kibana. I like both. Athena is best for working with huge volumes of log data, because it is more economical to work with it in a compressed binary format. A terabyte of JSON text data reduces down to approximately 30 gig or less in parquet format. Our developers are more productive when they use ElasticSearch/Kibana to analyze problems in their log files, because ElasticSeach and Kibana are so easy to use. The curator Lambda function that controls logging retention times and is a part of AWS Centralized logging is also very convenient.

like image 27
Terry Avatar answered Oct 07 '22 02:10

Terry