How to convert an 500GB SQL table into Apache Parquet?

Question

Perhaps this is well documented, but I am getting very confused how to do this (there are many Apache tools).

When I create an SQL table, I create the table using the following commands:

CREATE TABLE table_name(
   column1 datatype,
   column2 datatype,
   column3 datatype,
   .....
   columnN datatype,
   PRIMARY KEY( one or more columns )
);

How does one convert this exist table into Parquet? This file is written to disk? If the original data is several GB, how long does one have to wait?

Could I format the original raw data into Parquet format instead?

liprais · Accepted Answer

Apache Spark can be used to do this:

1.load your table from mysql via jdbc
2.save it as a parquet file

Example:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.jdbc("YOUR_MYSQL_JDBC_CONN_STRING",  "YOUR_TABLE",properties={"user": "YOUR_USER", "password": "YOUR_PASSWORD"})
df.write.parquet("YOUR_HDFS_FILE")

Markus Klein · Answer

The odbc2parquet command line tool might also be helpful in some situations.

odbc2parquet \
-vvv \ # Log output, good to know it is still doing something during large downloads
query \ # Subcommand for accessing data and storing it
--connection-string ${ODBC_CONNECTION_STRING} \
--batch-size 100000 \ # Batch size in rows
--batches-per-file 100 \ # Ommit to store entire query in a single file
out.par \ # Path to output parquet file
"SELECT * FROM YourTable"

How to convert an 500GB SQL table into Apache Parquet?

Tags:

sql-server

mysql

hadoop

parquet

ShanZhengYang

2 Answers

liprais

Markus Klein

Recent Activity

Donate For Us

How to convert an 500GB SQL table into Apache Parquet?

Tags:

sql-server

mysql

hadoop

parquet

ShanZhengYang

2 Answers

liprais

Markus Klein

Related questions

Recent Activity

Donate For Us