Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Parquet over some RDBMS like Postgres

I'm working to build a data architecture for my company. A simple ETL with internal and external data with the aim to build static dashboard and other to search trend.

I try to think about every step of the ETL process one by one and now I'm questioning about the Load part.

I plan to use Spark (LocalExcecutor on dev and a service on Azure for production) so I started to think about using Parquet into a Blob service. I know all the advantage of Parquet over CSV or other storage format and I really love this piece of technology. Most of the articles I read about Spark finish with a df.write.parquet(...).

But I cannot figure it out why can I just start a Postgres and save everything here. I understand that we are not producing 100Go per day of data but I want to build something future proof in a fast growing company that gonna produce exponentially data by the business and by the logs and metrics we start recording more and more.

Any pros/cons by more experienced dev ?

EDIT : What also make me questioning this is this tweet : https://twitter.com/markmadsen/status/1044360179213651968

like image 534
Ragnar Avatar asked Sep 18 '19 14:09

Ragnar


People also ask

What is the advantage of a Parquet file?

Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.

Why is PostgreSQL better than other database?

PostgreSQL has a lot of capability. Built using an object-relational model, it supports complex structures and a breadth of built-in and user-defined data types. It provides extensive data capacity and is trusted for its data integrity.

Why does Parquet work better than spark?

Parquet has higher execution speed compared to other standard file formats like Avro,JSON etc and it also consumes less disk space in compare to AVRO and JSON.

Why is Parquet faster?

In short, Being column-oriented, Parquet brings all the efficient storage characteristics (e.g., blocks, row group, column chunks) to the table. Apache Parquet is built from the ground using the Google shredding and assembly algorithm. Parquet files were designed with complex nested data structures in mind.


Video Answer


3 Answers

The main trade-off is one of cost and transactional semantics.

Using a DBMS means you can load data transactionally. You also pay for both storage and compute on an on-going basis. The storage costs for the same amount of data are going to be more expensive in a managed DBMS vs a blob store.

It is also harder to scale out processing on a DBMS (it appears the largest size Postgres server Azure offers has 64 vcpus). By storing data into an RDBMs you are likely going to run-up against IO or compute bottlenecks more quickly then you would with Spark + blob storage. However, for many datasets this might not be an issue and as the tweet points out if you can accomplish everything inside a the DB with SQL then it is a much simpler architecture.

If you store Parquet files on a blob-store, updating existing data is difficult without regenerating a large segment of your data (and I don't know the details of Azure but generally can't be done transactionally). The compute costs are separate from the storage costs.

like image 152
Micah Kornfield Avatar answered Oct 22 '22 10:10

Micah Kornfield


Storing data in Hadoop using raw file formats is terribly inefficient. Parquet is a Row Columnar file format well suited for querying large amounts of data in quick time. As you said above, writing data to Parquet from Spark is pretty easy. Also writing data using a distributed processing engine (Spark) to a distributed file system (Parquet+HDFS) makes the entire flow seamless. This architecture is well suited for OLAP type data.

Postgres on the other hand is a relational database. While it is good for storing and analyzing transactional data, it cannot be scaled horizontally as easily as HDFS can be. Hence when writing/querying large amount of data from Spark to/on Postgres, the database can become a bottleneck. But if the data you are processing is OLTP type, then you can consider this architecture.

Hope this helps

like image 3
Pushkin Avatar answered Oct 22 '22 10:10

Pushkin


One of the issues I have with a dedicated Postgres server is that it's a fixed resource that's on 24/7. If it's idle for 22 hours per day and under heavy load 2 hours per day (in particular if the hours aren't continuous and are unpredictable) then the server sizing during those 2 hours is going to be too low whereas during the other 22 hours it's too high.

If you store your data as parquet on Azure Data Lake Gen 2 and then use Serverless Synapse for SQL queries then you don't pay for anything on a 24/7 basis. When under heavy load, everything scales automatically.

The other benefit is that parquet files are compressed whereas Postgres doesn't store data compressed.

The downfall is "latency" (probably not the right term but it's how I think of it). If you want to query a small amount of data then, in my experience, it's slower with the file + Serverless approach compared to a well indexed clustered or partitioned Postgres table. Additionally, it's really hard to forecast your bill with the Serverless model coming from the server model. There's definitely going to be usage patterns where Serverless is going to be more expensive than a dedicated server. In particular if you do a lot of queries that have to read all or most of the data.

It's easier/faster to save a parquet than to do a lot of inserts. This is a double edged sword because the db guarantees acidity whereas saving parquet files doesn't.

Parquet storage optimization is its own task. Postgres has autovacuum. If the data you're consuming is published daily but you want it on a node/attribute/feature partition scheme then you need to do that manually (perhaps with spark pools).

like image 2
Dean MacGregor Avatar answered Oct 22 '22 12:10

Dean MacGregor