What is the most efficient way to do a sorted reduce in PySpark?

Tags:

I am analyzing on-time performance records of US domestic flights from 2015. I need to group by tail number, and store a date sorted list of all the flights for each tail number in a database, to be retrieved by my application. I am not sure which of two options for achieving this is the best one.

# Load the parquet file
on_time_dataframe = sqlContext.read.parquet('../data/on_time_performance.parquet')

# Filter down to the fields we need to identify and link to a flight
flights = on_time_dataframe.rdd.map(lambda x: 
  (x.Carrier, x.FlightDate, x.FlightNum, x.Origin, x.Dest, x.TailNum)
  )

I can achieve this in a reduce sort...

# Group flights by tail number, sorted by date, then flight number, then 
origin/dest
flights_per_airplane = flights\
  .map(lambda nameTuple: (nameTuple[5], [nameTuple]))\
  .reduceByKey(lambda a, b: sorted(a + b, key=lambda x: (x[1],x[2],x[3],x[4])))

Or I can achieve it in a subsequent map job...

# Do same in a map step, more efficient or does pySpark know how to optimize the above?
flights_per_airplane = flights\
  .map(lambda nameTuple: (nameTuple[5], [nameTuple]))\
  .reduceByKey(lambda a, b: a + b)\
  .map(lambda tuple: 
    (
      tuple[0], sorted(tuple[1], key=lambda x: (x[1],x[2],x[3],x[4])))
    )

Doing this in the reduce seems really inefficient, but in fact both are very slow. sorted() looks like the way to do this in the PySpark docs, so I'm wondering if PySpark doesn't make this kosher internally? Which option is the most efficient or the best choice for some other reason?

My code is also in a gist here: https://gist.github.com/rjurney/af27f70c76dc6c6ae05c465271331ade

If you're curious about the data, it is from the Bureau of Transportation Statistics, here: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time

472

asked Apr 02 '16 17:04

rjurney

1 Answers

Unfortunately both ways are wrong before you even start sorting and there is no effective and simple way of doing this in Spark. Still, the first one is significantly worse than the other.

Why both ways are wrong? Because it is just another groupByKey and it is simply an expensive operation. There are some ways you can try to improve things (especially to avoid map side reduction) but at the end of the day you just have to pay the price of a full shuffle and if you don't see any failures it is probably not worth all the fuss.

Still, the second approach is much better algorithmically*. If you want to keep sorted structure all the way through like in the first attempt you should dedicated tools (aggregateByKey with bisect.insort would be a good choice) but there is really nothing to gain here.

If the grouped output is a hard requirement the best thing you can do is to keyBy, groupByKey and sort. It won't improve performance over the second solution but arguably will improve readability:

(flights
    .keyBy(lambda x: x[5])
    .groupByKey()
    .mapValues(lambda vs: sorted(vs, key=lambda x: x[1:5])))

* Even if you assume the best case scenario for Timsort the first approach is N times O(N) while the second one is O(N log N) in the worst case scenario.

answered Sep 30 '22 08:09

zero323

Related questions
                            
                                Pygtk color for drag_highlight
                            
                                How to exclude a single file from package with setuptools and setup.py
                            
                                How to save web page as text file [Python]
                            
                                Qt Property Browser Framework or similar in python
                            
                                Spark Python error "FileNotFoundError: [WinError 2] The system cannot find the file specified"
                            
                                How to visualize (dendrogram) a dictionary of hierarchical items?
                            
                                Django-activity-stream : Apps aren't loaded yet
                            
                                bokeh multiple figures with shared legend
                            
                                Python: List containing sublist of strings
                            
                                Resize a batch of images in numpy
                            
                                shuffling a list with restrictions in Python
                            
                                Python: How do I find which pip package a library belongs to?
                            
                                Python: Get source code of class (using inspect)
                            
                                Insert file records into postgres db using clojure jdbc is taking long time compared to python psycopg2
                            
                                Creating separate database connection for every celery worker
                            
                                2^n Itertools combinations with advanced filtering
                            
                                How can I fill arbitrary closed regions in Matplotlib?
                            
                                Schedule reminder for recurring event
                            
                                Read Amibroker price volume data using python
                            
                                XOR Neural Network Converges to 0.5

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the most efficient way to do a sorted reduce in PySpark?

Tags:

python

python-2.7

apache-spark

pyspark

mapreduce

rjurney

People also ask

1 Answers

zero323

Recent Activity

Donate For Us