Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to copy and convert parquet files to csv

I have access to a hdfs file system and can see parquet files with

hadoop fs -ls /user/foo

How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row.

like image 349
graffe Avatar asked Sep 09 '16 21:09

graffe


2 Answers

If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:

insert overwrite local directory dirname
  row format delimited fields terminated by ','
  select * from tablename;

Substitute dirname and tablename with actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queries for details.

like image 53
Zoltan Avatar answered Nov 14 '22 14:11

Zoltan


Snippet for a more dynamic form, since you might not exactly know what's the name of your parquet file, will be:

for filename in glob.glob("[location_of_parquet_file]/*.snappy.parquet"):
        print filename
        df = sqlContext.read.parquet(filename)
        df.write.csv("[destination]")
        print "csv generated"
like image 3
Yusuf Hassan Avatar answered Nov 14 '22 14:11

Yusuf Hassan