I want to filter a DataFrame using a condition related to the length of a column, this question might be very easy but I didn't find any related question in the SO. 
More specific, I have a DataFrame with only one Column which of ArrayType(StringType()), I want to filter the DataFrame using the length as filterer, I shot a snippet below.
df = sqlContext.read.parquet("letters.parquet") df.show()  # The output will be  # +------------+ # |      tokens| # +------------+ # |[L, S, Y, S]| # |[L, V, I, S]| # |[I, A, N, A]| # |[I, L, S, A]| # |[E, N, N, Y]| # |[E, I, M, A]| # |[O, A, N, A]| # |   [S, U, S]| # +------------+  # But I want only the entries with length 3 or less fdf = df.filter(len(df.tokens) <= 3) fdf.show() # But it says that the TypeError: object of type 'Column' has no len(), so the previous statement is obviously incorrect.   I read Column's Documentation, but didn't find any property useful for this matter. I appreciate any help!
Get the number of columns: len(df. columns) The number of columns of pandas. DataFrame can be obtained by applying len() to the columns attribute.
To find the length of strings in a data frame you have the len method on the dataframes str property. But to do this you need to call this method on the column that contains the string data.
Get Number of Rows in DataFrameYou can use len(df. index) to find the number of rows in pandas DataFrame, df. index returns RangeIndex(start=0, stop=8, step=1) and use it on len() to get the count.
In Spark >= 1.5 you can use size function:
from pyspark.sql.functions import col, size  df = sqlContext.createDataFrame([     (["L", "S", "Y", "S"],  ),     (["L", "V", "I", "S"],  ),     (["I", "A", "N", "A"],  ),     (["I", "L", "S", "A"],  ),     (["E", "N", "N", "Y"],  ),     (["E", "I", "M", "A"],  ),     (["O", "A", "N", "A"],  ),     (["S", "U", "S"],  )],      ("tokens", ))  df.where(size(col("tokens")) <= 3).show()  ## +---------+ ## |   tokens| ## +---------+ ## |[S, U, S]| ## +---------+   In Spark < 1.5 an UDF should do the trick:
from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf  size_ = udf(lambda xs: len(xs), IntegerType())  df.where(size_(col("tokens")) <= 3).show()  ## +---------+ ## |   tokens| ## +---------+ ## |[S, U, S]| ## +---------+   If you use HiveContext then size UDF with raw SQL should work with any version:
df.registerTempTable("df") sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()  ## +--------------------+ ## |              tokens| ## +--------------------+ ## |ArrayBuffer(S, U, S)| ## +--------------------+   For string columns you can either use an udf defined above or length function:
from pyspark.sql.functions import length  df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", )) df.where(length(col("k")) <= 3).show()  ## +---+ ## |  k| ## +---+ ## |bar| ## +---+ 
                        Here is an example for String in scala:
val stringData = Seq(("Maheswara"), ("Mokshith")) val df = sc.parallelize(stringData).toDF df.where((length($"value")) <= 8).show +--------+ |   value| +--------+ |Mokshith| +--------+ df.withColumn("length", length($"value")).show +---------+------+ |    value|length| +---------+------+ |Maheswara|     9| | Mokshith|     8| +---------+------+ 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With