I am working with PySpark dataframes here. "test1" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the result is a integer variable but when I try to get max of the same column the result is a dataframe. I would like to understand what operations result in a dataframe and variable. I would also like to know how to store the max of the event date as a variable Code that results in an integer type: <pre class="prettyprint"><code>loop_cnt=test1.select('event_date').distinct().count() type(loop_cnt) </code></pre> Code that results in dataframe type: <pre class="prettyprint"><code>last_processed_dt=test1.select([max('event_date')]) type(last_processed_dt) </code></pre> Edited to add a reproducible example: <pre class="prettyprint"><code>schema = StructType([StructField("event_date", TimestampType(), True)]) df = sqlContext.createDataFrame([(datetime(2015, 8, 10, 2, 44, 15),),(datetime(2015, 8, 10, 3, 44, 15),)], schema) </code></pre> Code that returns a dataframe: <pre class="prettyprint"><code>last_processed_dt=df.select([max('event_date')]) type(last_processed_dt) </code></pre> Code that returns a varible: <pre class="prettyprint"><code>loop_cnt=df.select('event_date').distinct().count() type(loop_cnt) </code></pre>

You cannot directly access the values in a dataframe. Dataframe returns a Row Object. Instead Dataframe gives you a option to convert it into a python dictionary. Go through the following example where I will calculate average wordcount: <pre class="prettyprint"><code>wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word']) wordCountsDF = wordsDF.groupBy(wordsDF['word']).count() wordCountsDF.show() </code></pre> Here are the word count results: <pre class="prettyprint"><code>+--------+-----+ | word|count| +--------+-----+ | cat| 2| | rat| 2| |elephant| 1| +--------+-----+ </code></pre> Now I calculate the average of count column apply collect() operation on it. Remember collect() returns a list.Here the list contains one element only. <pre class="prettyprint"><code>averageCount = wordCountsDF.groupBy().avg('count').collect() </code></pre> Result looks something like this. <pre class="prettyprint"><code>[Row(avg(count)=1.6666666666666667)] </code></pre> You cannot access directly the average value using some python variable. You have to convert it into a dictionary to access it. <pre class="prettyprint"><code>results={} for i in averageCount: results.update(i.asDict()) print results </code></pre> Our final results look like these: <pre class="prettyprint"><code>{'avg(count)': 1.6666666666666667} </code></pre> Finally you can access average value using: <pre class="prettyprint"><code>print results['avg(count)'] 1.66666666667 </code></pre>

I'm pretty sure <code>df.select([max('event_date')])</code> returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max <code>event_date</code>. <code>df.select('event_date').distinct().count()</code> returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest. If you want code to get the max <code>event_date</code> and store it as a variable, try the following <code>max_date = df.select([max('event_date')]).distinct().collect()</code>

Store aggregate value of a PySpark dataframe column into a variable

Tags:

apache-spark

pyspark

I am working with PySpark dataframes here. "test1" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the result is a integer variable but when I try to get max of the same column the result is a dataframe. I would like to understand what operations result in a dataframe and variable. I would also like to know how to store the max of the event date as a variable

Code that results in an integer type:

loop_cnt=test1.select('event_date').distinct().count()
type(loop_cnt)

Code that results in dataframe type:

last_processed_dt=test1.select([max('event_date')])
type(last_processed_dt)

Edited to add a reproducible example:

schema = StructType([StructField("event_date", TimestampType(), True)])

df = sqlContext.createDataFrame([(datetime(2015, 8, 10, 2, 44, 15),),(datetime(2015, 8, 10, 3, 44, 15),)], schema)

Code that returns a dataframe:

last_processed_dt=df.select([max('event_date')])
type(last_processed_dt)

Code that returns a varible:

loop_cnt=df.select('event_date').distinct().count()
type(loop_cnt)

210

asked May 02 '16 16:05

Sid

2 Answers

You cannot directly access the values in a dataframe. Dataframe returns a Row Object. Instead Dataframe gives you a option to convert it into a python dictionary. Go through the following example where I will calculate average wordcount:

wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordCountsDF = wordsDF.groupBy(wordsDF['word']).count()
wordCountsDF.show()

Here are the word count results:

+--------+-----+
|    word|count|
+--------+-----+
|     cat|    2|
|     rat|    2|
|elephant|    1|
+--------+-----+

Now I calculate the average of count column apply collect() operation on it. Remember collect() returns a list.Here the list contains one element only.

averageCount = wordCountsDF.groupBy().avg('count').collect()

Result looks something like this.

[Row(avg(count)=1.6666666666666667)]

You cannot access directly the average value using some python variable. You have to convert it into a dictionary to access it.

results={}
for i in averageCount:
  results.update(i.asDict())
print results

Our final results look like these:

{'avg(count)': 1.6666666666666667}

Finally you can access average value using:

print results['avg(count)']

1.66666666667

answered Sep 30 '22 16:09

sujit

I'm pretty sure df.select([max('event_date')]) returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date.

df.select('event_date').distinct().count() returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.

If you want code to get the max event_date and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()

answered Sep 30 '22 16:09

Katya Willard

Related questions
                            
                                Calculate a grouped median in pyspark
                            
                                spark scala : Convert Array of Struct column to String column
                            
                                spark select and add columns with alias
                            
                                What does withReplacement do, if specified for sample against a Spark Dataframe
                            
                                Apache Spark: dealing with Option/Some/None in RDDs
                            
                                How to access local files in Spark on Windows?
                            
                                GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table
                            
                                Concatenate Sparse Vectors in Spark?
                            
                                JSON file parsing in Pyspark
                            
                                How to check if array column is inside another column array in PySpark dataframe
                            
                                Count number of columns in pyspark Dataframe?
                            
                                How to concatenate/append multiple Spark dataframes column wise in Pyspark?
                            
                                Spark _temporary creation reason
                            
                                How to convert empty arrays to nulls?
                            
                                Escape New line character in Spark CSV read
                            
                                Python pandas_udf spark error
                            
                                repartition() is not affecting RDD partition size
                            
                                Spark - write Avro file
                            
                                How to create a Dataset from custom class Person?
                            
                                Running Apache.Spark - log4j:WARN Please initialize the log4j system properly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With