I have access to a hdfs file system and can see parquet files with
hadoop fs -ls /user/foo
How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row.
If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:
insert overwrite local directory dirname row format delimited fields terminated by ',' select * from tablename;
Substitute dirname
and tablename
with actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queries for details.
Snippet for a more dynamic form, since you might not exactly know what's the name of your parquet file, will be:
for filename in glob.glob("[location_of_parquet_file]/*.snappy.parquet"):
print filename
df = sqlContext.read.parquet(filename)
df.write.csv("[destination]")
print "csv generated"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With