Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

i have started the shell with databrick csv package

#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0 

Then i read a csv file did some groupby op and dump that to a csv.

from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv')   ####it has columns and df.columns works fine type(df)   #<class 'pyspark.sql.dataframe.DataFrame'> #now trying to dump a csv df.write.format('com.databricks.spark.csv').save('path+my.csv') #it creates a directory my.csv with 2 partitions ### To create single file i followed below line of code #df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name #but in both cases no columns information(How to add column names to that csv file???) # again i am trying to read that csv by df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv") #i am not getting any columns in that..1st row becomes column names 

Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.

Question1- while giving csv dump is there any way i can add column name with that???

Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???

note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.

like image 551
Satya Avatar asked Jul 27 '16 11:07

Satya


People also ask

How do you read a column name while reading a CSV file?

While reading the CSV file, you can rename the column headers by using the names parameter. The names parameter takes the list of names of the column header. To avoid the old header being inferred as a row for the data frame, you can provide the header parameter which will override the old header names with new names.


1 Answers

Try

df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')

Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.

like image 69
Mike Metzger Avatar answered Sep 29 '22 00:09

Mike Metzger