Why Parquet over some RDBMS like Postgres

Tags:

I'm working to build a data architecture for my company. A simple ETL with internal and external data with the aim to build static dashboard and other to search trend.

I try to think about every step of the ETL process one by one and now I'm questioning about the Load part.

I plan to use Spark (LocalExcecutor on dev and a service on Azure for production) so I started to think about using Parquet into a Blob service. I know all the advantage of Parquet over CSV or other storage format and I really love this piece of technology. Most of the articles I read about Spark finish with a df.write.parquet(...).

But I cannot figure it out why can I just start a Postgres and save everything here. I understand that we are not producing 100Go per day of data but I want to build something future proof in a fast growing company that gonna produce exponentially data by the business and by the logs and metrics we start recording more and more.

Any pros/cons by more experienced dev ?

EDIT : What also make me questioning this is this tweet : https://twitter.com/markmadsen/status/1044360179213651968

534

asked Sep 18 '19 14:09

Ragnar

Video Answer

3 Answers

The main trade-off is one of cost and transactional semantics.

Using a DBMS means you can load data transactionally. You also pay for both storage and compute on an on-going basis. The storage costs for the same amount of data are going to be more expensive in a managed DBMS vs a blob store.

It is also harder to scale out processing on a DBMS (it appears the largest size Postgres server Azure offers has 64 vcpus). By storing data into an RDBMs you are likely going to run-up against IO or compute bottlenecks more quickly then you would with Spark + blob storage. However, for many datasets this might not be an issue and as the tweet points out if you can accomplish everything inside a the DB with SQL then it is a much simpler architecture.

If you store Parquet files on a blob-store, updating existing data is difficult without regenerating a large segment of your data (and I don't know the details of Azure but generally can't be done transactionally). The compute costs are separate from the storage costs.

152

answered Oct 22 '22 10:10

Micah Kornfield

Storing data in Hadoop using raw file formats is terribly inefficient. Parquet is a Row Columnar file format well suited for querying large amounts of data in quick time. As you said above, writing data to Parquet from Spark is pretty easy. Also writing data using a distributed processing engine (Spark) to a distributed file system (Parquet+HDFS) makes the entire flow seamless. This architecture is well suited for OLAP type data.

Postgres on the other hand is a relational database. While it is good for storing and analyzing transactional data, it cannot be scaled horizontally as easily as HDFS can be. Hence when writing/querying large amount of data from Spark to/on Postgres, the database can become a bottleneck. But if the data you are processing is OLTP type, then you can consider this architecture.

Hope this helps

answered Oct 22 '22 10:10

Pushkin

One of the issues I have with a dedicated Postgres server is that it's a fixed resource that's on 24/7. If it's idle for 22 hours per day and under heavy load 2 hours per day (in particular if the hours aren't continuous and are unpredictable) then the server sizing during those 2 hours is going to be too low whereas during the other 22 hours it's too high.

If you store your data as parquet on Azure Data Lake Gen 2 and then use Serverless Synapse for SQL queries then you don't pay for anything on a 24/7 basis. When under heavy load, everything scales automatically.

The other benefit is that parquet files are compressed whereas Postgres doesn't store data compressed.

The downfall is "latency" (probably not the right term but it's how I think of it). If you want to query a small amount of data then, in my experience, it's slower with the file + Serverless approach compared to a well indexed clustered or partitioned Postgres table. Additionally, it's really hard to forecast your bill with the Serverless model coming from the server model. There's definitely going to be usage patterns where Serverless is going to be more expensive than a dedicated server. In particular if you do a lot of queries that have to read all or most of the data.

It's easier/faster to save a parquet than to do a lot of inserts. This is a double edged sword because the db guarantees acidity whereas saving parquet files doesn't.

Parquet storage optimization is its own task. Postgres has autovacuum. If the data you're consuming is published daily but you want it on a node/attribute/feature partition scheme then you need to do that manually (perhaps with spark pools).

answered Oct 22 '22 12:10

Dean MacGregor

Related questions
                            
                                SQLALchemy DB Session with Flask, Postgres
                            
                                verifying data consistency between two postgresql databases
                            
                                Setting up travis.ci with Rails and Postgres
                            
                                what value to set in postgresql.conf to enable use of "localhost" and "127.0.0.1" and ip address? [closed]
                            
                                Slow GroupAggregate in PostgreSQL
                            
                                Postgresql performance comparison between arrays and joins
                            
                                django: Proper way to recover from IntegrityError
                            
                                Filter postgres JSON column by null value in SQLAlchemy
                            
                                Django/Postgres migration failing "django.db.utils.ProgrammingError: relation "django_site" does not exist"
                            
                                Postgres Index-only-scan: can we ignore the visibility map or avoid heap fetches?
                            
                                Illegal UTF-8 sequence connecting with postgreSQL database
                            
                                ProgrammingError at / relation "main_post" does not exist
                            
                                Get id of inserted row from upsert (insertOrUpdate) Sequelize Node.js
                            
                                Full text search and variable binding using postgres and jOOQ not working
                            
                                PostgreSQL: Encrypt Column With pgcrypto
                            
                                UseNpgsql not available in IServiceCollection in .NET Core
                            
                                SequelizeDatabaseError: relation table does not exist
                            
                                Knex silently converts Postgres timestamps with timezone and returns incorrect time
                            
                                Sequelize Error: Relation does not exist
                            
                                How to make a fast pg_trgm DESC (descending)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why Parquet over some RDBMS like Postgres

Tags:

postgresql

apache-spark

parquet