I have a DataFrame (that gets converted to RDD) and would like to repartition so that each key (first column) has its own partition. This is what I did:
# Repartition to # key partitions and map each row to a partition given their key rank
my_rdd = df.rdd.partitionBy(len(keys), lambda row: int(row[0]))
However, when I try to map it back to DataFrame or save it, then I get this error:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
for obj in iterator:
File "spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1703, in add_shuffle_key
for k, v in iterator:
ValueError: too many values to unpack
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
A bit more testing revealed that even this causes the same error: my_rdd = df.rdd.partitionBy(x) # x = can be 5, 100, etc
Have any of you encountered this before. If so how did you solve it?
partitionBy
requires a PairwiseRDD
which in Python is equivalent to RDD
of tuples (lists) of length 2 where the first element is a key and the second one is a value.
partitionFunc
takes the key and maps it to the partition number. When you use it on a RDD[Row]
it tries to unpack row into a key an value and fails:
from pyspark.sql import Row
row = Row(1, 2, 3)
k, v = row
## Traceback (most recent call last):
## ...
## ValueError: too many values to unpack (expected 2)
Even if you provide a correct data doing something like this:
my_rdd = (df.rdd.map(lambda row: (int(row[0]), row)).partitionBy(len(keys))
it wouldn't really make sense. Partitioning is not particularly meaningful in case of a DataFrames
. See my answer to How to define partitioning of DataFrame? for more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With