Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert csv to parquet file using python

I am trying to convert a .csv file to a .parquet file.
The csv file (Temp.csv) has the following format

1,Jon,Doe,Denver 

I am using the following python code to convert it into parquet

from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * import os  if __name__ == "__main__":     sc = SparkContext(appName="CSV2Parquet")     sqlContext = SQLContext(sc)      schema = StructType([             StructField("col1", IntegerType(), True),             StructField("col2", StringType(), True),             StructField("col3", StringType(), True),             StructField("col4", StringType(), True)])     dirname = os.path.dirname(os.path.abspath(__file__))     csvfilename = os.path.join(dirname,'Temp.csv')         rdd = sc.textFile(csvfilename).map(lambda line: line.split(","))     df = sqlContext.createDataFrame(rdd, schema)     parquetfilename = os.path.join(dirname,'output.parquet')         df.write.mode('overwrite').parquet(parquetfilename) 

The result is only a folder named, output.parquet and not a parquet file that I'm looking for, followed by the following error on the console.

CSV to Parquet Error

I have also tried running the following code to face a similar issue.

from pyspark.sql import SparkSession import os  spark = SparkSession \     .builder \     .appName("Protob Conversion to Parquet") \     .config("spark.some.config.option", "some-value") \     .getOrCreate()  # read csv dirname = os.path.dirname(os.path.abspath(__file__)) csvfilename = os.path.join(dirname,'Temp.csv')     df = spark.read.csv(csvfilename)  # Displays the content of the DataFrame to stdout df.show() parquetfilename = os.path.join(dirname,'output.parquet')     df.write.mode('overwrite').parquet(parquetfilename) 

How to best do it? Using windows, python 2.7.

like image 1000
inquisitiveProgrammer Avatar asked May 30 '18 11:05

inquisitiveProgrammer


1 Answers

Using the packages pyarrow and pandas you can convert CSVs to Parquet without using a JVM in the background:

import pandas as pd df = pd.read_csv('example.csv') df.to_parquet('output.parquet') 

One limitation in which you will run is that pyarrow is only available for Python 3.5+ on Windows. Either use Linux/OSX to run the code as Python 2 or upgrade your windows setup to Python 3.6.

like image 77
Uwe L. Korn Avatar answered Sep 21 '22 22:09

Uwe L. Korn