I have a df whose 'products' column are lists like below:
+----------+---------+--------------------+ |member_srl|click_day| products| +----------+---------+--------------------+ | 12| 20161223| [2407, 5400021771]| | 12| 20161226| [7320, 2407]| | 12| 20170104| [2407]| | 12| 20170106| [2407]| | 27| 20170104| [2405, 2407]| | 28| 20161212| [2407]| | 28| 20161213| [2407, 100093]| | 28| 20161215| [1956119]| | 28| 20161219| [2407, 100093]| | 28| 20161229| [7905970]| | 124| 20161011| [5400021771]| | 6963| 20160101| [103825645]| | 6963| 20160104|[3000014912, 6626...| | 6963| 20160111|[99643224, 106032...|
How to add a new column product_cnt
which are the length of products
list? And how to filter df to get specified rows with condition of given products length ? Thanks.
To get the number of columns present in the PySpark DataFrame, use DataFrame. columns with len() function. Here, DataFrame. columns return all column names of a DataFrame as a list then use the len() function to get the length of the array/list which gets you the count of columns present in PySpark DataFrame.
Spark SQL provides a length() function that takes the DataFrame column type as a parameter and returns the number of characters (including trailing spaces) in a string. This function can be used to filter() the DataFrame rows by the length of a column. If the input column is Binary, it returns the number of bytes.
Spark/PySpark provides size() SQL function to get the size of the array & map type columns in DataFrame (number of elements in ArrayType or MapType columns).
Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.
Pyspark has a built-in function to achieve exactly what you want called size
. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size . To add it as column, you can simply call it during your select statement.
from pyspark.sql.functions import size countdf = df.select('*',size('products').alias('product_cnt'))
Filtering works exactly as @titiro89 described. Furthermore, you can use the size
function in the filter. This will allow you to bypass adding the extra column (if you wish to do so) in the following way.
filterdf = df.filter(size('products')==given_products_length)
First question:
How to add a new column product_cnt which are the length of products list?
>>> a = [(12,20161223, [2407,5400021771]),(12,20161226,[7320,2407])] >>> df = spark.createDataFrame(a, ["member_srl","click_day","products"]) >>> df.show() +----------+---------+------------------+ |member_srl|click_day| products| +----------+---------+------------------+ | 12| 20161223|[2407, 5400021771]| | 12| 20161226|[7320, 2407, 4344]| +----------+---------+------------------+
You can find a similar example here
>>> from pyspark.sql.types import IntegerType >>> from pyspark.sql.functions import udf >>> slen = udf(lambda s: len(s), IntegerType()) >>> df2 = df.withColumn("product_cnt", slen(df.products)) >>> df2.show() +----------+---------+------------------+-----------+ |member_srl|click_day| products|product_cnt| +----------+---------+------------------+-----------+ | 12| 20161223|[2407, 5400021771]| 2| | 12| 20161226|[7320, 2407, 4344]| 3| +----------+---------+------------------+-----------+
Second question:
And how to filter df to get specified rows with condition of given products length ?
You can use filter function docs here
>>> givenLength = 2 >>> df3 = df2.filter(df2.product_cnt==givenLength) >>> df3.show() +----------+---------+------------------+-----------+ |member_srl|click_day| products|product_cnt| +----------+---------+------------------+-----------+ | 12| 20161223|[2407, 5400021771]| 2| +----------+---------+------------------+-----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With