Spark 2.0 read csv number of partitions (PySpark)

Tags:

I'm trying to port some code from Spark 1.6 to Spark 2.0 using new stuffs from Spark 2.0. First, I want to use the csv reader from Spark 2.0. BTW, I'm using pyspark.

With the "old" textFile function, I'm able to set the minimum number of partitions. Ex:

file= sc.textFile('/home/xpto/text.csv', minPartitions=10)
header = file.first() #extract header
data = file.filter(lambda x:x !=header) #csv without header
...

Now, with Spark 2.0 I can read the csv directly:

df = spark.read.csv('/home/xpto/text.csv', header=True)
...

But I didn't find a way to set the minPartitions.

I need this to test the performance of my code.

Thx, Fred

387

asked Jun 30 '16 16:06

Frederico Oliveira

2 Answers

The short answer is no: you can't set a minimum bar using a mechanism similar to the minPartitions parameter if using a DataFrameReader.

coalesce may be used in this case to reduce the partitions count, and repartition may be used to increase the partition count. When you are using coalesce, downstream performance may be better if you force a shuffle by providing the shuffle parameter (especially in cases of skewed data): coalesce(100,shuffle=True). This triggers a full shuffle of data, which carries cost implications similar to repartition.

Note that the above operations generally do not keep the original order of the file read (excepting if running coalesce without the shuffle parameter), so if a portion of your code depends on the dataset's order, you should avoid a shuffle prior to that point.

112

answered Sep 24 '22 01:09

Vijay Krishna

I figured it out. The DataFrame (and RDD) has a method called "coalesce". Where the number of partitions can be set.

Ex:

>>> df = spark.read.csv('/home/xpto/text.csv', header=True).coalesce(2)
>>> df.rdd.getNumPartitions()
2

In my case, Spark splited my file in 153 partitions. I'm able to set the number of partitions to 10, but when I try to set to 300, it ignores and uses the 153 again (I don't know why).

REF: https://spark.apache.org/docs/2.0.0-preview/api/python/pyspark.sql.html#pyspark.sql.DataFrame.coalesce

answered Sep 23 '22 01:09

Frederico Oliveira

Related questions
                            
                                Export each data frame within a list to csv [duplicate]
                            
                                Bulk insert in SQL Server
                            
                                PHP - Export to CSV an array of objects
                            
                                Postgres COPY FORCE_NOT_NULL option not recognized
                            
                                Python - Task Scheduler 0x1
                            
                                Export dates to Csv and tell excel the format
                            
                                Python - Display rows with repeated values in csv files
                            
                                Writing multiple header lines in pandas.DataFrame.to_csv
                            
                                Creating a CSV file from a Meteor.js Collection
                            
                                Get return code of the mysql command
                            
                                In Python, what is the easiest way to add a list consisting of keyword pairs to a dictionary?
                            
                                Javascript d3 reading from csv
                            
                                read specific line in csv file , python
                            
                                how to read a csv into a dictionary in python?
                            
                                using pandas in python to append csv files into one
                            
                                Does Excel for Mac 2016 properly import Unicode in .csv files?
                            
                                Dealing with commas within a field in a csv file using pyspark
                            
                                How can I get the second column of a very large csv file using linux command?
                            
                                How do I read a CSV file that's Gzipped from URL - Python [duplicate]
                            
                                Awk/sed replace newlines

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark 2.0 read csv number of partitions (PySpark)

Tags:

csv

apache-spark

pyspark

Frederico Oliveira

People also ask

2 Answers

Vijay Krishna

Frederico Oliveira

Recent Activity

Donate For Us