The original dataset is: <pre class="prettyprint"><code># (numbersofrating,title,avg_rating) newRDD =[(3,'monster',4),(4,'minions 3D',5),....] </code></pre> I want to select top N avg_ratings in newRDD.I use the following code,it has an error. <pre class="prettyprint"><code>selectnewRDD = (newRDD.map(x, key =lambda x: x[2]).sortBy(......)) TypeError: map() takes no keyword arguments </code></pre> The expected data should be: <pre class="prettyprint"><code># (numbersofrating,title,avg_rating) selectnewRDD =[(4,'minions 3D',5),(3,'monster',4)....] </code></pre>

You can use either <code>top</code> or <code>takeOrdered</code> with <code>key</code> argument: <pre class="prettyprint"><code>newRDD.top(2, key=lambda x: x[2]) </code></pre> or <pre class="prettyprint"><code>newRDD.takeOrdered(2, key=lambda x: -x[2]) </code></pre> Note that <code>top</code> is taking elements in descending order and <code>takeOrdered</code> in ascending so <code>key</code> function is different in both cases.

Have you tried using <code>top</code>? Given that you want the top avg ratings (and it is the third item in the tuple), you'll need to assign it to the key using a <code>lambda</code> function. <pre class="prettyprint"><code># items = (number_of_ratings, title, avg_rating) newRDD = sc.parallelize([(3, 'monster', 4), (4, 'minions 3D', 5)]) top_n = 10 >>> newRDD.top(top_n, key=lambda items: items[2]) [(4, 'minions 3D', 5), (3, 'monster', 4)] </code></pre>

Spark select top values in RDD

Tags:

python

apache-spark

rdd

The original dataset is:

# (numbersofrating,title,avg_rating)
newRDD =[(3,'monster',4),(4,'minions 3D',5),....]

I want to select top N avg_ratings in newRDD.I use the following code,it has an error.

selectnewRDD = (newRDD.map(x, key =lambda x: x[2]).sortBy(......))

TypeError: map() takes no keyword arguments

The expected data should be:

# (numbersofrating,title,avg_rating)
selectnewRDD =[(4,'minions 3D',5),(3,'monster',4)....]

617

asked Aug 07 '15 16:08

user3849475

2 Answers

You can use either top or takeOrdered with key argument:

newRDD.top(2, key=lambda x: x[2])

newRDD.takeOrdered(2, key=lambda x: -x[2])

Note that top is taking elements in descending order and takeOrdered in ascending so key function is different in both cases.

149

answered Nov 15 '22 23:11

zero323

Have you tried using top? Given that you want the top avg ratings (and it is the third item in the tuple), you'll need to assign it to the key using a lambda function.

# items = (number_of_ratings, title, avg_rating)
newRDD = sc.parallelize([(3, 'monster', 4), (4, 'minions 3D', 5)])
top_n = 10
>>> newRDD.top(top_n, key=lambda items: items[2])
[(4, 'minions 3D', 5), (3, 'monster', 4)]

answered Nov 15 '22 22:11

Alexander

Related questions
                            
                                How to quantitatively measure goodness of fit in SciPy?
                            
                                How to iterate over `dict` in class like if just referring to `dict`?
                            
                                python: calculate center of mass
                            
                                Trouble installing pygame using pip install
                            
                                how to change the subject for Django error reporting emails?
                            
                                How to put two decimals in cell with type of percent
                            
                                Python: how to get values from a dictionary from pandas series
                            
                                Why does not GridSearchCV give best score ? - Scikit Learn
                            
                                I am trying to plot a 5*2 plot in python
                            
                                Expected type 'Union[ndarray, Iterable]' warning in Python instruction
                            
                                How to download only the latest file from SFTP server with Paramiko?
                            
                                Python How To Import And Use Module With One Line
                            
                                How to limit cross correlation window width in Numpy?
                            
                                Different number of return values in Python function
                            
                                How to return error messages in JSON with Bottle HTTPError?
                            
                                Plotting multiple time series after a groupby in pandas
                            
                                Find the tf-idf score of specific words in documents using sklearn
                            
                                Add time to datetime
                            
                                Lowercase django query
                            
                                How can I know which element in a list triggered an any() function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With