Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to directly insert data from a parquet file into PostgreSQL database?

I'm trying to restore some historic backup files that saved in parquet format, and I want to read from them once and write the data into a PostgreSQL database.

I know that backup files saved using spark, but there is a strict restriction for me that I cant install spark in the DB machine or read the parquet file using spark in a remote device and write it to the database using spark_df.write.jdbc. Everything needs to happen on the DB machine and in the absence of spark and Hadoop only using Postgres and Bash scripting.

my files structure is something like:

foo/
    foo/part-00000-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    foo/part-00001-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    foo/part-00002-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    ..
    ..

I expect to read data and schema from each parquet folder like foo, create a table using that schema and write the data into the shaped table, only using bash and Postgres CLI.

like image 616
Javad Bahoosh Avatar asked Nov 10 '19 08:11

Javad Bahoosh


People also ask

Does Postgres support Parquet files?

Modern PostgreSQL (14+) can parallelize access to foreign tables, so even collections of Parquet files can be scanned effectively.

What is the best way to transfer the data in a PostgreSQL database?

Data export with pg_dump The idea behind this dump method is to generate a file with SQL commands that, when fed back to the server, will recreate the database in the same state as it was at the time of the dump. PostgreSQL provides the utility program pg_dump for this purpose.

What big data platform commonly works with Parquet files?

Data can be compressed by using one of the several codecs available; as a result, different data files can be compressed differently. Apache Parquet works best with interactive and serverless technologies like AWS Athena, Amazon Redshift Spectrum, Google BigQuery and Google Dataproc.


1 Answers

You can using spark and converting parquet files to csv format, then moving the files to DB machine and import them by any tools.

spark.read.parquet("...").write.csv("...")
import pandas as pd
df = pd.read_csv('mypath.csv')
df.columns = [c.lower() for c in df.columns] #postgres doesn't like capitals or spaces

from sqlalchemy import create_engine
engine = create_engine('postgresql://username:password@localhost:5432/dbname')

df.to_sql("my_table_name", engine)
like image 122
Moein Hosseini Avatar answered Oct 12 '22 18:10

Moein Hosseini