How to copy and convert parquet files to csv

Question

I have access to a hdfs file system and can see parquet files with

hadoop fs -ls /user/foo

How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row.

Zoltan · Accepted Answer

If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:

insert overwrite local directory dirname
  row format delimited fields terminated by ','
  select * from tablename;

Substitute dirname and tablename with actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queries for details.

Yusuf Hassan · Answer

Snippet for a more dynamic form, since you might not exactly know what's the name of your parquet file, will be:

for filename in glob.glob("[location_of_parquet_file]/*.snappy.parquet"):
        print filename
        df = sqlContext.read.parquet(filename)
        df.write.csv("[destination]")
        print "csv generated"

How to copy and convert parquet files to csv

Tags:

python

apache-spark

hadoop

pyspark

parquet

graffe

2 Answers

Zoltan

Yusuf Hassan

Recent Activity

Donate For Us

How to copy and convert parquet files to csv

Tags:

python

apache-spark

hadoop

pyspark

parquet

graffe

2 Answers

Zoltan

Yusuf Hassan

Related questions

Recent Activity

Donate For Us