I understand that <code>partitionBy</code> function partitions my data. If I use <code>rdd.partitionBy(100)</code> it will partition my data by key into 100 parts. i.e. data associated with similar keys will be grouped together <ol> <li>Is my understanding correct?</li> <li>Is it advisable to have number of partitions equal to number of available cores? Does that make processing more efficient?</li> <li>what if my data is not in key,value format. Can i still use this function?</li> <li>lets say my data is serial_number_of_student,student_name. In this case can i partition my data by student_name instead of the serial_number?</li> </ol>

<ol> <li>Not exactly. Spark, including PySpark, is by default using hash partitioning. Excluding identical keys there is no practical similarity between keys assigned to a single partition. </li> <li> There is no simple answer here. All depends on amount of data and available resources. Too large or too low number of partitions will degrade the performance. Some resources claim the number of partitions should around twice as large as the number of available cores. From the other hand a single partition typically shouldn't contain more than 128MB and a single shuffle block cannot be larger than 2GB (See SPARK-6235). Finally you have to correct for potential data skews. If some keys are overrepresented in your dataset it can result in suboptimal resource usage and potential failure. </li> <li>No, or at least not directly. You can use <code>keyBy</code> method to convert RDD to required format. Moreover any Python object can be treated as a key-value pair as long as it implements required methods which make it behave like an <code>Iterable</code> of length equal two. See How to determine if object is a valid key-value pair in PySpark</li> <li>It depends on the types. As long as key is hashable* then yes. Typically it means it has to be immutable structure and all values it contains have to be immutable as well. For example a list is not a valid key but a <code>tuple</code> of integers is.</li> </ol> <hr> To quote Python glossary: <blockquote> An object is hashable if it has a hash value which never changes during its lifetime (it needs a <code>__hash__()</code> method), and can be compared to other objects (it needs an <code>__eq__()</code> method). Hashable objects which compare equal must have the same hash value. </blockquote>

pyspark partitioning data using partitionby

Is my understanding correct?
Is it advisable to have number of partitions equal to number of available cores? Does that make processing more efficient?
what if my data is not in key,value format. Can i still use this function?
lets say my data is serial_number_of_student,student_name. In this case can i partition my data by student_name instead of the serial_number?

236

asked Mar 13 '16 17:03

user2543622

1 Answers

Not exactly. Spark, including PySpark, is by default using hash partitioning. Excluding identical keys there is no practical similarity between keys assigned to a single partition.
There is no simple answer here. All depends on amount of data and available resources. Too large or too low number of partitions will degrade the performance.

Some resources claim the number of partitions should around twice as large as the number of available cores. From the other hand a single partition typically shouldn't contain more than 128MB and a single shuffle block cannot be larger than 2GB (See SPARK-6235).

Finally you have to correct for potential data skews. If some keys are overrepresented in your dataset it can result in suboptimal resource usage and potential failure.
No, or at least not directly. You can use keyBy method to convert RDD to required format. Moreover any Python object can be treated as a key-value pair as long as it implements required methods which make it behave like an Iterable of length equal two. See How to determine if object is a valid key-value pair in PySpark
It depends on the types. As long as key is hashable* then yes. Typically it means it has to be immutable structure and all values it contains have to be immutable as well. For example a list is not a valid key but a tuple of integers is.

To quote Python glossary:

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() method). Hashable objects which compare equal must have the same hash value.

131

answered Sep 26 '22 16:09

zero323

Related questions
                            
                                Any __future__ import for range-xrange incompatibility?
                            
                                Get Python Tornado Version?
                            
                                Displaying numbers with "X" instead of "e" scientific notation in matplotlib
                            
                                can't compare offset-naive and offset-aware datetimes - last_seen option [duplicate]
                            
                                How to plot a ylabel per subplot using pandas DataFrame plot function
                            
                                Concatenating Unicode with string: print '£' + '1' works, but print '£' + u'1' throws UnicodeDecodeError
                            
                                PIL/Pillow decode icc profile information
                            
                                pip install vs. conda install
                            
                                Put multiple items in a python queue
                            
                                Python Pandas Choosing Random Sample of Groups from Groupby
                            
                                How to fix Statsmodel warning: "Maximum no. of iterations has exceeded"
                            
                                Installing OpenCV 3 for Python 3 on a mac using Homebrew and pyenv
                            
                                Keras load weights of a neural network / error when predicting
                            
                                How to use Tweepy to retweet with a comment
                            
                                Option to ignore extra keywords in an sqlalchemy Mapped Class constructor?
                            
                                Accessing folders, subfolders and subfiles using PyDrive (Python)
                            
                                How can a Python list be sliced such that a column is moved to being a separate element column?
                            
                                Getting boolean pandas column that supports NA/ is nullable
                            
                                How to configure uwsgi to use multiple python paths
                            
                                Connecting to IBM AS400 server for database operations hangs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark partitioning data using partitionby

Tags:

python

apache-spark

rdd

pyspark

partitioning

user2543622

People also ask

1 Answers

zero323

Recent Activity

Donate For Us