Am currently building an ETL pipeline, which outputs tables of data (order of ~100+ GBs) to a downstream interactive dashboard, which allows filtering the data dynamically (based on pre-defined & indexed filters).
Have zeroed in on using PySpark / Spark for the initial ETL phase. Next, this processed data will be summarised (simple counts, averages, etc.) & then visualised in the interactive dashboard.
Towards the interactive querying part, I was wondering which tool might work best with my structured & transactional data (stored in Parquet format) -
Feel free to suggest alternative tools, if you know of a better option.
Querying the Parquet file from AWS Athena Now that the data and the metadata are created, we can use AWS Athena to query the parquet file.
Parquet has higher execution speed compared to other standard file formats like Avro,JSON etc and it also consumes less disk space in compare to AVRO and JSON.
Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. On the other hand, Amazon RDS for Aurora is detailed as "MySQL and PostgreSQL compatible relational database with several times better performance".
Amazon Athena is Amazon Web Services' fastest growing service – driven by increasing adoption of AWS data lakes, and the simple, seamless model Athena offers for querying huge datasets stored on Amazon using regular SQL.
Based on the information you've provided, I am going to make several assumptions:
Going through the options listed
I would also look into Amazon Redshift.
For further reading, read Big Data Analytics Options on AWS.
As @Damien_The_Unbeliever recommended, there will be no substitute for your own prototyping and benchmarking.
Athena is not limited to .csv. In fact using binary compressed formats like parquet are a best practice for use with Athena, because it substantially reduces query times and cost. I have used AWS firehose, lambda functions and glue crawlers for converting text data to a compressed binary format for querying via Athena. When I have had issues with processing large data volumes, the issue was forgetting to raise the default Athena limits set for the accounts. I have a friend who processes gigantic volumes of utility data for predictive analytics, and he did encounter scaling problems with Athena, but that was in its early days.
I also work with ElasticSearch with Kibana as a text search engine and we use the AWS Log Analytics "solution" based on ElasticSearch and Kibana. I like both. Athena is best for working with huge volumes of log data, because it is more economical to work with it in a compressed binary format. A terabyte of JSON text data reduces down to approximately 30 gig or less in parquet format. Our developers are more productive when they use ElasticSearch/Kibana to analyze problems in their log files, because ElasticSeach and Kibana are so easy to use. The curator Lambda function that controls logging retention times and is a part of AWS Centralized logging is also very convenient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With