pyspark: grouby and then get max value of each group

Tags:

I would like to group by a value and then find the max value in each group using PySpark. I have the following code but now I am bit stuck on how to extract the max value.

# some file contains tuples ('user', 'item', 'occurrences')
data_file = sc.textData('file:///some_file.txt')
# Create the triplet so I index stuff
data_file = data_file.map(lambda l: l.split()).map(lambda l: (l[0], l[1], float(l[2])))
# Group by the user i.e. r[0]
grouped = data_file.groupBy(lambda r: r[0])
# Here is where I am stuck 
group_list = grouped.map(lambda x: (list(x[1]))) #?

Returns something like:

[[(u'u1', u's1', 20), (u'u1', u's2', 5)], [(u'u2', u's3', 5), (u'u2', u's2', 10)]]

I want to find max 'occurrence' for each user now. The final result after doing the max would result in a RDD that looked like this:

[[(u'u1', u's1', 20)], [(u'u2', u's2', 10)]]

Where only the max dataset would remain for each of the users in the file. In other words, I want to change the value of the RDD to contain only a single triplet the each users max occurrences.

735

asked Nov 15 '15 03:11

user985030

2 Answers

There is no need for groupBy here. Simple reduceByKey would do just fine and most of the time will be more efficient:

data_file = sc.parallelize([
   (u'u1', u's1', 20), (u'u1', u's2', 5),
   (u'u2', u's3', 5), (u'u2', u's2', 10)])

max_by_group = (data_file
  .map(lambda x: (x[0], x))  # Convert to PairwiseRD
  # Take maximum of the passed arguments by the last element (key)
  # equivalent to:
  # lambda x, y: x if x[-1] > y[-1] else y
  .reduceByKey(lambda x1, x2: max(x1, x2, key=lambda x: x[-1])) 
  .values()) # Drop keys

max_by_group.collect()
## [('u2', 's2', 10), ('u1', 's1', 20)]

answered Oct 12 '22 23:10

zero323

I think I found the solution:

from pyspark import SparkContext, SparkConf

def reduce_by_max(rdd):
    """
    Helper function to find the max value in a list of values i.e. triplets. 
    """
    max_val = rdd[0][2]
    the_index = 0

    for idx, val in enumerate(rdd):
        if val[2] > max_val:
            max_val = val[2]
            the_index = idx

    return rdd[the_index]

conf = SparkConf() \
    .setAppName("Collaborative Filter") \
    .set("spark.executor.memory", "5g")
sc = SparkContext(conf=conf)

# some file contains tuples ('user', 'item', 'occurrences')
data_file = sc.textData('file:///some_file.txt')

# Create the triplet so I can index stuff
data_file = data_file.map(lambda l: l.split()).map(lambda l: (l[0], l[1], float(l[2])))

# Group by the user i.e. r[0]
grouped = data_file.groupBy(lambda r: r[0])

# Get the values as a list
group_list = grouped.map(lambda x: (list(x[1]))) 

# Get the max value for each user. 
max_list = group_list.map(reduce_by_max).collect()

answered Oct 12 '22 22:10

user985030

Related questions
                            
                                How do I create an "OR" filter using elasticsearch-dsl-py?
                            
                                Managing Celery Task Results
                            
                                Buildozer failed to execute the last command
                            
                                Get start and stop from a python slice object
                            
                                How to append selected columns to pandas dataframe from df with different columns
                            
                                Kafka-python get number of partitions for topic
                            
                                dynamic module does not define init function (PyInit_fuzzy)
                            
                                Flask-SQLAlchemy check if database server is responsive
                            
                                How to throw exception if script is run with Python 2?
                            
                                Pandas difference in index with date values
                            
                                Count number of rows when row contains certain text
                            
                                odoo - display name of many2one field combination of 2 fields
                            
                                SQLAlchemy query shows error "Can't join table/selectable 'workflows' to itself"
                            
                                Python module reference with Sphinx documentation
                            
                                Using annotate or extra to add field of foreignkey to queryset ? (equivalent of SQL "AS" ?)
                            
                                Pandas Percentage count on a DataFrame groupby
                            
                                Django ImportError: No module named middleware
                            
                                Random sample of paired lists in Python
                            
                                using Flask, python and postgresql how can I connect to a pre-existing database?
                            
                                Why doesn't Python `NamedTemporaryFile` default `delete` to `False`?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark: grouby and then get max value of each group

Tags:

python

apache-spark

rdd

pyspark

user985030

People also ask

2 Answers

zero323

user985030

Recent Activity

Donate For Us