writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

Tags:

i have started the shell with databrick csv package

#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0

Then i read a csv file did some groupby op and dump that to a csv.

from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv')   ####it has columns and df.columns works fine type(df)   #<class 'pyspark.sql.dataframe.DataFrame'> #now trying to dump a csv df.write.format('com.databricks.spark.csv').save('path+my.csv') #it creates a directory my.csv with 2 partitions ### To create single file i followed below line of code #df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name #but in both cases no columns information(How to add column names to that csv file???) # again i am trying to read that csv by df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv") #i am not getting any columns in that..1st row becomes column names

Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.

Question1- while giving csv dump is there any way i can add column name with that???

Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???

note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.

551

asked Jul 27 '16 11:07

Satya

1 Answers

Try

df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')

Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.

answered Sep 29 '22 00:09

Mike Metzger

Related questions
                            
                                Python - Trap all signals
                            
                                Generating pdf-latex with python script
                            
                                Recommended way to manage credentials with multiple AWS accounts?
                            
                                Insert variable into global namespace from within a function? [duplicate]
                            
                                Rounding down integers to nearest multiple
                            
                                How to get user permissions?
                            
                                Error with matplotlib.show() : module 'matplotlib' has no attribute 'show' [duplicate]
                            
                                Python: load words from file into a set
                            
                                Has anyone parsed Wiktionary? [closed]
                            
                                How to create a month iterator
                            
                                How to convert integer into date object python?
                            
                                Regular expression to return all characters between two special characters
                            
                                Delete digits in Python (Regex)
                            
                                How to write simple geometric shapes into numpy arrays
                            
                                ImportError: cannot import name 'docevents' from 'botocore.docs.bcdoc' in AWS CodeBuild
                            
                                printing tab-separated values of a list
                            
                                Python string slice indices - slice to end of string
                            
                                How to save a list as a .csv file with python with new lines?
                            
                                List comprehension, check if item is unique
                            
                                Python, PowerShell, or Other? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql

note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.

Satya

People also ask

1 Answers

Mike Metzger

Recent Activity

Donate For Us