Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get value out of dataframe

In Scala I can do get(#) or getAs[Type](#) to get values out of a dataframe. How should I do it in pyspark?

I have a two columns DataFrame: item(string) and salesNum(integers). I do a groupby and mean to get a mean of those numbers like this:

saleDF.groupBy("salesNum").mean()).collect()

and it works. Now I have the mean in a dataframe with one value.

How can I get that value out of the dataframe to get the mean as a float number?

like image 887
M.Rez Avatar asked Jun 27 '16 16:06

M.Rez


People also ask

How do you extract a value from a data frame?

Select Cell Value from DataFrame Using df['col_name']. values[] We can use df['col_name']. values[] to get 1×1 DataFrame as a NumPy array, then access the first and only value of that array to get a cell value, for instance, df["Duration"].

How do you get a value from a DataFrame in Python?

get_value() function is used to quickly retrieve the single value in the data frame at the passed column and index. The input to the function is the row label and the column label.


3 Answers

collect() returns your results as a python list. To get the value out of the list you just need to take the first element like this:

saleDF.groupBy("salesNum").mean()).collect()[0] 
like image 130
David Avatar answered Oct 12 '22 20:10

David


To be precise, collect returns a list whose elements are of type class 'pyspark.sql.types.Row'.

In your case to extract the real value you should do:

saleDF.groupBy("salesNum").mean()).collect()[0]["avg(yourColumnName)"]

where yourColumnName is the name of the column you are taking the mean of (pyspark, when applying mean, renames the resulting column in this way by default).

As an example, I ran the following code. Look at the types and outputs of each step.

>>> columns = ['id', 'dogs', 'cats', 'nation']
>>> vals = [
...      (2, 0, 1, 'italy'),
...      (1, 2, 0, 'italy'),
...      (3, 4, 0, 'france')
... ]
>>> df = sqlContext.createDataFrame(vals, columns)
>>> df.groupBy("nation").mean("dogs").collect()
[Row(nation=u'france', avg(dogs)=4.0), Row(nation=u'italy', avg(dogs)=1.0)]
>>> df.groupBy("nation").mean("dogs").collect()[0]
Row(nation=u'france', avg(dogs)=4.0))
>>> df.groupBy("nation").mean("dogs").collect()[0]["avg(dogs)"]
4.0
>>> type(df.groupBy("nation").mean("dogs").collect())
<type 'list'>
>>> type(df.groupBy("nation").mean("dogs").collect()[0])
<class 'pyspark.sql.types.Row'>
>>> type(df.groupBy("nation").mean("dogs").collect()[0]["avg(dogs)"])
<type 'float'>
>>> 
>>>     
like image 41
Francesco Boi Avatar answered Oct 12 '22 21:10

Francesco Boi


we can use first() also here.

saleDF.groupBy("salesNum").mean()).first()[0]
like image 42
Jithu Avatar answered Oct 12 '22 20:10

Jithu