Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark partitioning data using partitionby

I understand that partitionBy function partitions my data. If I use rdd.partitionBy(100) it will partition my data by key into 100 parts. i.e. data associated with similar keys will be grouped together

  1. Is my understanding correct?
  2. Is it advisable to have number of partitions equal to number of available cores? Does that make processing more efficient?
  3. what if my data is not in key,value format. Can i still use this function?
  4. lets say my data is serial_number_of_student,student_name. In this case can i partition my data by student_name instead of the serial_number?
like image 236
user2543622 Avatar asked Mar 13 '16 17:03

user2543622


People also ask

How do I partition data in PySpark?

Partition in memory: You can partition or repartition the DataFrame by calling repartition() or coalesce() transformations. Partition on disk: While writing the PySpark DataFrame back to disk, you can choose how to partition the data based on columns using partitionBy() of pyspark. sql. DataFrameWriter .

Is repartition faster than coalesce?

repartition redistributes the data evenly, but at the cost of a shuffle. coalesce works much faster when you reduce the number of partitions because it sticks input partitions together. coalesce doesn't guarantee uniform data distribution. coalesce is identical to a repartition when you increase the number of ...

How many partitions should I use PySpark?

The general recommendation for Spark is to have 4x of partitions to the number of cores in cluster available for application, and for upper bound — the task should take 100ms+ time to execute.


1 Answers

  1. Not exactly. Spark, including PySpark, is by default using hash partitioning. Excluding identical keys there is no practical similarity between keys assigned to a single partition.
  2. There is no simple answer here. All depends on amount of data and available resources. Too large or too low number of partitions will degrade the performance.

    Some resources claim the number of partitions should around twice as large as the number of available cores. From the other hand a single partition typically shouldn't contain more than 128MB and a single shuffle block cannot be larger than 2GB (See SPARK-6235).

    Finally you have to correct for potential data skews. If some keys are overrepresented in your dataset it can result in suboptimal resource usage and potential failure.

  3. No, or at least not directly. You can use keyBy method to convert RDD to required format. Moreover any Python object can be treated as a key-value pair as long as it implements required methods which make it behave like an Iterable of length equal two. See How to determine if object is a valid key-value pair in PySpark

  4. It depends on the types. As long as key is hashable* then yes. Typically it means it has to be immutable structure and all values it contains have to be immutable as well. For example a list is not a valid key but a tuple of integers is.

To quote Python glossary:

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() method). Hashable objects which compare equal must have the same hash value.

like image 131
zero323 Avatar answered Sep 26 '22 16:09

zero323