Coming from a SQL development background, and currently learning pyspark / python I am a bit confused with querying data / chaining methods, using python. for instance the query below (taken from 'Learning Spark 2nd Edition'): <pre class="prettyprint"><code>fire_ts_df. select("CallType") .where(col("CallType").isNotNull()) .groupBy("CallType") .count() .orderBy("count", ascending=False) .show(n=10, truncate=False) </code></pre> will execute just fine. What i don't understand though, is that if i had written the code like: (moved the call to 'count()' higher) <pre class="prettyprint"><code>fire_ts_df. select("CallType") .count() .where(col("CallType").isNotNull()) .groupBy("CallType") .orderBy("count", ascending=False) .show(n=10, truncate=False) </code></pre> this wouldn't work. The problem is that i don't want to memorize the order, but i want to understand it. I feel it has something to do with proper method chaining in Python / Pyspark but I am not sure how to justify it. In other words, in a case like this, where multiple methods should be invoked and chained using (.), what is the right order and is there any specific rule to follow? Thanks a lot in advance

The important thing to note here is that chained methods necessarily do not occur in random order. The operations represented by these method calls are not some associative transformations applied flatly on the data from left to right. Each method call could be written as a separate statement, where each statement produces a result that makes the input to the next operation, and so on until the result. <pre class="prettyprint"><code>fire_ts_df. select("CallType") # selects column CallType into a 1-col DF .where(col("CallType").isNotNull()) # Filters rows on the 1-column DF from select() .groupBy("CallType") # Group filtered DF by the one column into a pyspark.sql.group.GroupedData object .count() # Creates a new DF off the GroupedData with counts .orderBy("count", ascending=False) # Sorts the aggregated DF, as a new DF .show(n=10, truncate=False) # Prints the last DF </code></pre> Just to use your example to explain why this doesn't work, calling <code>count()</code> on a <code>pyspark.sql.group.GroupedData</code> creates a new data frame with aggregation results. But <code>count()</code> called on a <code>DataFrame</code> object returns just the number of records, which means that the following call, <code>.where(col("CallType").isNotNull())</code>, is made on a long, which simply doesn't make sense. Longs don't have that filter method. As said above, you may visualize it differently by rewriting the code in separate statements: <pre class="prettyprint"><code>call_type_df = fire_ts_df.select("CallType") non_null_call_type = call_type_df.where(col("CallType").isNotNull()) groupings = non_null_call_type.groupBy("CallType") counts_by_call_type_df = groupings.count() ordered_counts = counts_by_call_type_df.orderBy("count", ascending=False) ordered_counts.show(n=10, truncate=False) </code></pre> As you can see, the ordering is meaningful as the succession of operations is consistent with their respective output. Chained calls make what is referred to as fluent APIs, which minimize verbosity. But this does not remove the fact that a chained method must be applicable to the type of the output of the preceding call (and in fact that the next operation is intended to be applied on the value produced by the one preceding it).

Python / Pyspark - Correct method chaining order rules

Tags:

python

apache-spark

apache-spark-sql

pyspark

method-chaining

Coming from a SQL development background, and currently learning pyspark / python I am a bit confused with querying data / chaining methods, using python.

for instance the query below (taken from 'Learning Spark 2nd Edition'):

Click to copy

fire_ts_df.               
 select("CallType")
 .where(col("CallType").isNotNull())
 .groupBy("CallType")
 .count()
 .orderBy("count", ascending=False)
 .show(n=10, truncate=False)

will execute just fine.

What i don't understand though, is that if i had written the code like: (moved the call to 'count()' higher)

Click to copy

fire_ts_df.               
 select("CallType")
 .count()
 .where(col("CallType").isNotNull())
 .groupBy("CallType")
 .orderBy("count", ascending=False)
 .show(n=10, truncate=False)

this wouldn't work. The problem is that i don't want to memorize the order, but i want to understand it. I feel it has something to do with proper method chaining in Python / Pyspark but I am not sure how to justify it. In other words, in a case like this, where multiple methods should be invoked and chained using (.), what is the right order and is there any specific rule to follow?

Thanks a lot in advance

441

asked Sep 21 '20 18:09

SerZZZ

1 Answers

The important thing to note here is that chained methods necessarily do not occur in random order. The operations represented by these method calls are not some associative transformations applied flatly on the data from left to right.

Each method call could be written as a separate statement, where each statement produces a result that makes the input to the next operation, and so on until the result.

Click to copy

fire_ts_df.                           
  select("CallType")                  # selects column CallType into a 1-col DF
 .where(col("CallType").isNotNull())  # Filters rows on the 1-column DF from select()
 .groupBy("CallType")                 # Group filtered DF by the one column into a pyspark.sql.group.GroupedData object
 .count()                             # Creates a new DF off the GroupedData with counts
 .orderBy("count", ascending=False)   # Sorts the aggregated DF, as a new DF
 .show(n=10, truncate=False)          # Prints the  last DF

Just to use your example to explain why this doesn't work, calling count() on a pyspark.sql.group.GroupedData creates a new data frame with aggregation results. But count() called on a DataFrame object returns just the number of records, which means that the following call, .where(col("CallType").isNotNull()), is made on a long, which simply doesn't make sense. Longs don't have that filter method.

As said above, you may visualize it differently by rewriting the code in separate statements:

Click to copy

call_type_df = fire_ts_df.select("CallType")
non_null_call_type = call_type_df.where(col("CallType").isNotNull())
groupings = non_null_call_type.groupBy("CallType")
counts_by_call_type_df = groupings.count()
ordered_counts = counts_by_call_type_df.orderBy("count", ascending=False)

ordered_counts.show(n=10, truncate=False)

As you can see, the ordering is meaningful as the succession of operations is consistent with their respective output.

Chained calls make what is referred to as fluent APIs, which minimize verbosity. But this does not remove the fact that a chained method must be applicable to the type of the output of the preceding call (and in fact that the next operation is intended to be applied on the value produced by the one preceding it).

answered Oct 23 '22 20:10

ernest_k

Related questions
                            
                                Unable to parse string at position 0 problem
                            
                                TypeError: Could not build a TypeSpec for a column
                            
                                Adding dates into a hovertemplate plotly
                            
                                Why are double curly braces used instead of backslash in python f-strings?
                            
                                Combining Python trace information and logging
                            
                                Is there a way to download video while keeping their chapters metadata? [closed]
                            
                                Package Python3.7 is not available
                            
                                error: Could not find a version that satisfies the requirement pprint (from -r requirements.txt (line 67)) (from versions: none)
                            
                                How to extract feature vector from single image in Pytorch?
                            
                                Yielding asyncio generator data back from event loop possible?
                            
                                Python Serverless Function Vercel - Next.js
                            
                                Google Calendar API drops "conferenceData" nested object
                            
                                Send a pandas dataframe to slack
                            
                                Python argparse select a list from choices
                            
                                Is there a convention for indicating a quantity's units in Python code?
                            
                                Extracting data from list in Python, after BeautifulSoup scrape, and creating Pandas table
                            
                                Pandas in df column extract string after colon if colon exits; if not, keep text
                            
                                Expand pandas dataframe and consolidate columns
                            
                                Pandas dataframe groupby make a list or array of a column
                            
                                python import path for sub modules if put in namespace package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python / Pyspark - Correct method chaining order rules

Tags:

python

apache-spark

apache-spark-sql

pyspark

method-chaining

SerZZZ

People also ask

1 Answers

ernest_k

Recent Activity

Donate For Us