How to get the lists' length in one column in dataframe spark?

Tags:

pyspark

I have a df whose 'products' column are lists like below:

+----------+---------+--------------------+ |member_srl|click_day|            products| +----------+---------+--------------------+ |        12| 20161223|  [2407, 5400021771]| |        12| 20161226|        [7320, 2407]| |        12| 20170104|              [2407]| |        12| 20170106|              [2407]| |        27| 20170104|        [2405, 2407]| |        28| 20161212|              [2407]| |        28| 20161213|      [2407, 100093]| |        28| 20161215|           [1956119]| |        28| 20161219|      [2407, 100093]| |        28| 20161229|           [7905970]| |       124| 20161011|        [5400021771]| |      6963| 20160101|         [103825645]| |      6963| 20160104|[3000014912, 6626...| |      6963| 20160111|[99643224, 106032...|

How to add a new column product_cnt which are the length of products list? And how to filter df to get specified rows with condition of given products length ? Thanks.

741

asked Jun 14 '17 10:06

2 Answers

Pyspark has a built-in function to achieve exactly what you want called size. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size . To add it as column, you can simply call it during your select statement.

from pyspark.sql.functions import size  countdf = df.select('*',size('products').alias('product_cnt'))

Filtering works exactly as @titiro89 described. Furthermore, you can use the size function in the filter. This will allow you to bypass adding the extra column (if you wish to do so) in the following way.

filterdf = df.filter(size('products')==given_products_length)

answered Sep 21 '22 08:09

DavidWayne

First question:

How to add a new column product_cnt which are the length of products list?

>>> a = [(12,20161223, [2407,5400021771]),(12,20161226,[7320,2407])] >>> df = spark.createDataFrame(a, ["member_srl","click_day","products"]) >>> df.show() +----------+---------+------------------+ |member_srl|click_day|          products| +----------+---------+------------------+ |        12| 20161223|[2407, 5400021771]| |        12| 20161226|[7320, 2407, 4344]| +----------+---------+------------------+

You can find a similar example here

>>> from pyspark.sql.types import IntegerType >>> from pyspark.sql.functions import udf  >>> slen = udf(lambda s: len(s), IntegerType())  >>> df2 = df.withColumn("product_cnt", slen(df.products)) >>> df2.show() +----------+---------+------------------+-----------+ |member_srl|click_day|          products|product_cnt| +----------+---------+------------------+-----------+ |        12| 20161223|[2407, 5400021771]|          2| |        12| 20161226|[7320, 2407, 4344]|          3| +----------+---------+------------------+-----------+

Second question:

And how to filter df to get specified rows with condition of given products length ?

You can use filter function docs here

>>> givenLength = 2 >>> df3 = df2.filter(df2.product_cnt==givenLength) >>> df3.show() +----------+---------+------------------+-----------+ |member_srl|click_day|          products|product_cnt| +----------+---------+------------------+-----------+ |        12| 20161223|[2407, 5400021771]|          2| +----------+---------+------------------+-----------+

answered Sep 19 '22 08:09

titiro89

Related questions
                            
                                Using Grouped Map Pandas UDFs with arguments
                            
                                How to use custom classes with Apache Spark (pyspark)?
                            
                                How to get the number of workers(executors) in PySpark?
                            
                                Spark Data Frame Random Splitting
                            
                                Save a large Spark Dataframe as a single json file in S3
                            
                                PySpark - get row number for each row in a group
                            
                                Apply a function to a single column of a csv in Spark
                            
                                Pyspark - converting json string to DataFrame
                            
                                How to calculate mean and standard deviation given a PySpark DataFrame?
                            
                                Comparison operator in PySpark (not equal/ !=)
                            
                                How to get a value from the Row object in Spark Dataframe?
                            
                                How to access SparkContext from SparkSession instance?
                            
                                Add new rows to pyspark Dataframe
                            
                                (null) entry in command string exception in saveAsTextFile() on Pyspark
                            
                                PySpark - Pass list as parameter to UDF
                            
                                Convert a standard python key value dictionary list to pyspark data frame
                            
                                How to explode multiple columns of a dataframe in pyspark
                            
                                how to enable Apache Arrow in Pyspark
                            
                                Can PySpark work without Spark?
                            
                                Does spark predicate pushdown work with JDBC?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With