Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering DataFrame using the length of a column

I want to filter a DataFrame using a condition related to the length of a column, this question might be very easy but I didn't find any related question in the SO.

More specific, I have a DataFrame with only one Column which of ArrayType(StringType()), I want to filter the DataFrame using the length as filterer, I shot a snippet below.

df = sqlContext.read.parquet("letters.parquet") df.show()  # The output will be  # +------------+ # |      tokens| # +------------+ # |[L, S, Y, S]| # |[L, V, I, S]| # |[I, A, N, A]| # |[I, L, S, A]| # |[E, N, N, Y]| # |[E, I, M, A]| # |[O, A, N, A]| # |   [S, U, S]| # +------------+  # But I want only the entries with length 3 or less fdf = df.filter(len(df.tokens) <= 3) fdf.show() # But it says that the TypeError: object of type 'Column' has no len(), so the previous statement is obviously incorrect. 

I read Column's Documentation, but didn't find any property useful for this matter. I appreciate any help!

like image 611
Alberto Bonsanto Avatar asked Nov 13 '15 14:11

Alberto Bonsanto


People also ask

How do I get the length of a column of a DataFrame in Python?

Get the number of columns: len(df. columns) The number of columns of pandas. DataFrame can be obtained by applying len() to the columns attribute.

How do I find the length of a string in a DataFrame column?

To find the length of strings in a data frame you have the len method on the dataframes str property. But to do this you need to call this method on the column that contains the string data.

Can you use Len on a DataFrame?

Get Number of Rows in DataFrameYou can use len(df. index) to find the number of rows in pandas DataFrame, df. index returns RangeIndex(start=0, stop=8, step=1) and use it on len() to get the count.


2 Answers

In Spark >= 1.5 you can use size function:

from pyspark.sql.functions import col, size  df = sqlContext.createDataFrame([     (["L", "S", "Y", "S"],  ),     (["L", "V", "I", "S"],  ),     (["I", "A", "N", "A"],  ),     (["I", "L", "S", "A"],  ),     (["E", "N", "N", "Y"],  ),     (["E", "I", "M", "A"],  ),     (["O", "A", "N", "A"],  ),     (["S", "U", "S"],  )],      ("tokens", ))  df.where(size(col("tokens")) <= 3).show()  ## +---------+ ## |   tokens| ## +---------+ ## |[S, U, S]| ## +---------+ 

In Spark < 1.5 an UDF should do the trick:

from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf  size_ = udf(lambda xs: len(xs), IntegerType())  df.where(size_(col("tokens")) <= 3).show()  ## +---------+ ## |   tokens| ## +---------+ ## |[S, U, S]| ## +---------+ 

If you use HiveContext then size UDF with raw SQL should work with any version:

df.registerTempTable("df") sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()  ## +--------------------+ ## |              tokens| ## +--------------------+ ## |ArrayBuffer(S, U, S)| ## +--------------------+ 

For string columns you can either use an udf defined above or length function:

from pyspark.sql.functions import length  df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", )) df.where(length(col("k")) <= 3).show()  ## +---+ ## |  k| ## +---+ ## |bar| ## +---+ 
like image 192
zero323 Avatar answered Oct 02 '22 08:10

zero323


Here is an example for String in scala:

val stringData = Seq(("Maheswara"), ("Mokshith")) val df = sc.parallelize(stringData).toDF df.where((length($"value")) <= 8).show +--------+ |   value| +--------+ |Mokshith| +--------+ df.withColumn("length", length($"value")).show +---------+------+ |    value|length| +---------+------+ |Maheswara|     9| | Mokshith|     8| +---------+------+ 
like image 20
mputha Avatar answered Oct 02 '22 07:10

mputha