In Scala I can do get(#)
or getAs[Type](#)
to get values out of a dataframe. How should I do it in pyspark
?
I have a two columns DataFrame: item(string)
and salesNum(integers)
. I do a groupby
and mean
to get a mean of those numbers like this:
saleDF.groupBy("salesNum").mean()).collect()
and it works. Now I have the mean in a dataframe with one value.
How can I get that value out of the dataframe to get the mean as a float number?
Select Cell Value from DataFrame Using df['col_name']. values[] We can use df['col_name']. values[] to get 1×1 DataFrame as a NumPy array, then access the first and only value of that array to get a cell value, for instance, df["Duration"].
get_value() function is used to quickly retrieve the single value in the data frame at the passed column and index. The input to the function is the row label and the column label.
collect()
returns your results as a python list. To get the value out of the list you just need to take the first element like this:
saleDF.groupBy("salesNum").mean()).collect()[0]
To be precise, collect
returns a list whose elements are of type class 'pyspark.sql.types.Row'
.
In your case to extract the real value you should do:
saleDF.groupBy("salesNum").mean()).collect()[0]["avg(yourColumnName)"]
where yourColumnName
is the name of the column you are taking the mean of (pyspark, when applying mean, renames the resulting column in this way by default).
As an example, I ran the following code. Look at the types and outputs of each step.
>>> columns = ['id', 'dogs', 'cats', 'nation']
>>> vals = [
... (2, 0, 1, 'italy'),
... (1, 2, 0, 'italy'),
... (3, 4, 0, 'france')
... ]
>>> df = sqlContext.createDataFrame(vals, columns)
>>> df.groupBy("nation").mean("dogs").collect()
[Row(nation=u'france', avg(dogs)=4.0), Row(nation=u'italy', avg(dogs)=1.0)]
>>> df.groupBy("nation").mean("dogs").collect()[0]
Row(nation=u'france', avg(dogs)=4.0))
>>> df.groupBy("nation").mean("dogs").collect()[0]["avg(dogs)"]
4.0
>>> type(df.groupBy("nation").mean("dogs").collect())
<type 'list'>
>>> type(df.groupBy("nation").mean("dogs").collect()[0])
<class 'pyspark.sql.types.Row'>
>>> type(df.groupBy("nation").mean("dogs").collect()[0]["avg(dogs)"])
<type 'float'>
>>>
>>>
we can use first()
also here.
saleDF.groupBy("salesNum").mean()).first()[0]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With