Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the lists' length in one column in dataframe spark?

Tags:

pyspark

I have a df whose 'products' column are lists like below:

+----------+---------+--------------------+ |member_srl|click_day|            products| +----------+---------+--------------------+ |        12| 20161223|  [2407, 5400021771]| |        12| 20161226|        [7320, 2407]| |        12| 20170104|              [2407]| |        12| 20170106|              [2407]| |        27| 20170104|        [2405, 2407]| |        28| 20161212|              [2407]| |        28| 20161213|      [2407, 100093]| |        28| 20161215|           [1956119]| |        28| 20161219|      [2407, 100093]| |        28| 20161229|           [7905970]| |       124| 20161011|        [5400021771]| |      6963| 20160101|         [103825645]| |      6963| 20160104|[3000014912, 6626...| |      6963| 20160111|[99643224, 106032...| 

How to add a new column product_cnt which are the length of products list? And how to filter df to get specified rows with condition of given products length ? Thanks.

like image 741
yanachen Avatar asked Jun 14 '17 10:06

yanachen


People also ask

How do I find the length of a list in a column PySpark?

To get the number of columns present in the PySpark DataFrame, use DataFrame. columns with len() function. Here, DataFrame. columns return all column names of a DataFrame as a list then use the len() function to get the length of the array/list which gets you the count of columns present in PySpark DataFrame.

How do I get the length of a column in Spark DataFrame?

Spark SQL provides a length() function that takes the DataFrame column type as a parameter and returns the number of characters (including trailing spaces) in a string. This function can be used to filter() the DataFrame rows by the length of a column. If the input column is Binary, it returns the number of bytes.

How do you find the size of a list in PySpark?

Spark/PySpark provides size() SQL function to get the size of the array & map type columns in DataFrame (number of elements in ArrayType or MapType columns).

How do I find the length of a Spark data frame?

Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.


2 Answers

Pyspark has a built-in function to achieve exactly what you want called size. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size . To add it as column, you can simply call it during your select statement.

from pyspark.sql.functions import size  countdf = df.select('*',size('products').alias('product_cnt')) 

Filtering works exactly as @titiro89 described. Furthermore, you can use the size function in the filter. This will allow you to bypass adding the extra column (if you wish to do so) in the following way.

filterdf = df.filter(size('products')==given_products_length) 
like image 86
DavidWayne Avatar answered Sep 21 '22 08:09

DavidWayne


First question:

How to add a new column product_cnt which are the length of products list?

>>> a = [(12,20161223, [2407,5400021771]),(12,20161226,[7320,2407])] >>> df = spark.createDataFrame(a, ["member_srl","click_day","products"]) >>> df.show() +----------+---------+------------------+ |member_srl|click_day|          products| +----------+---------+------------------+ |        12| 20161223|[2407, 5400021771]| |        12| 20161226|[7320, 2407, 4344]| +----------+---------+------------------+ 

You can find a similar example here

>>> from pyspark.sql.types import IntegerType >>> from pyspark.sql.functions import udf  >>> slen = udf(lambda s: len(s), IntegerType())  >>> df2 = df.withColumn("product_cnt", slen(df.products)) >>> df2.show() +----------+---------+------------------+-----------+ |member_srl|click_day|          products|product_cnt| +----------+---------+------------------+-----------+ |        12| 20161223|[2407, 5400021771]|          2| |        12| 20161226|[7320, 2407, 4344]|          3| +----------+---------+------------------+-----------+ 

Second question:

And how to filter df to get specified rows with condition of given products length ?

You can use filter function docs here

>>> givenLength = 2 >>> df3 = df2.filter(df2.product_cnt==givenLength) >>> df3.show() +----------+---------+------------------+-----------+ |member_srl|click_day|          products|product_cnt| +----------+---------+------------------+-----------+ |        12| 20161223|[2407, 5400021771]|          2| +----------+---------+------------------+-----------+ 
like image 31
titiro89 Avatar answered Sep 19 '22 08:09

titiro89