I want to filter a <code>DataFrame</code> using a condition related to the length of a column, this question might be very easy but I didn't find any related question in the SO. More specific, I have a <code>DataFrame</code> with only one <code>Column</code> which of <code>ArrayType(StringType())</code>, I want to filter the <code>DataFrame</code> using the length as filterer, I shot a snippet below. <pre class="prettyprint"><code>df = sqlContext.read.parquet("letters.parquet") df.show() # The output will be # +------------+ # | tokens| # +------------+ # |[L, S, Y, S]| # |[L, V, I, S]| # |[I, A, N, A]| # |[I, L, S, A]| # |[E, N, N, Y]| # |[E, I, M, A]| # |[O, A, N, A]| # | [S, U, S]| # +------------+ # But I want only the entries with length 3 or less fdf = df.filter(len(df.tokens) <= 3) fdf.show() # But it says that the TypeError: object of type 'Column' has no len(), so the previous statement is obviously incorrect. </code></pre> I read Column's Documentation, but didn't find any property useful for this matter. I appreciate any help!

In Spark >= 1.5 you can use <code>size</code> function: <pre class="prettyprint"><code>from pyspark.sql.functions import col, size df = sqlContext.createDataFrame([ (["L", "S", "Y", "S"], ), (["L", "V", "I", "S"], ), (["I", "A", "N", "A"], ), (["I", "L", "S", "A"], ), (["E", "N", "N", "Y"], ), (["E", "I", "M", "A"], ), (["O", "A", "N", "A"], ), (["S", "U", "S"], )], ("tokens", )) df.where(size(col("tokens")) <= 3).show() ## +---------+ ## | tokens| ## +---------+ ## |[S, U, S]| ## +---------+ </code></pre> In Spark < 1.5 an UDF should do the trick: <pre class="prettyprint"><code>from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf size_ = udf(lambda xs: len(xs), IntegerType()) df.where(size_(col("tokens")) <= 3).show() ## +---------+ ## | tokens| ## +---------+ ## |[S, U, S]| ## +---------+ </code></pre> If you use <code>HiveContext</code> then <code>size</code> UDF with raw SQL should work with any version: <pre class="prettyprint"><code>df.registerTempTable("df") sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show() ## +--------------------+ ## | tokens| ## +--------------------+ ## |ArrayBuffer(S, U, S)| ## +--------------------+ </code></pre> For string columns you can either use an <code>udf</code> defined above or <code>length</code> function: <pre class="prettyprint"><code>from pyspark.sql.functions import length df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", )) df.where(length(col("k")) <= 3).show() ## +---+ ## | k| ## +---+ ## |bar| ## +---+ </code></pre>

Filtering DataFrame using the length of a column

I want to filter a DataFrame using a condition related to the length of a column, this question might be very easy but I didn't find any related question in the SO.

More specific, I have a DataFrame with only one Column which of ArrayType(StringType()), I want to filter the DataFrame using the length as filterer, I shot a snippet below.

df = sqlContext.read.parquet("letters.parquet") df.show()  # The output will be  # +------------+ # |      tokens| # +------------+ # |[L, S, Y, S]| # |[L, V, I, S]| # |[I, A, N, A]| # |[I, L, S, A]| # |[E, N, N, Y]| # |[E, I, M, A]| # |[O, A, N, A]| # |   [S, U, S]| # +------------+  # But I want only the entries with length 3 or less fdf = df.filter(len(df.tokens) <= 3) fdf.show() # But it says that the TypeError: object of type 'Column' has no len(), so the previous statement is obviously incorrect.

I read Column's Documentation, but didn't find any property useful for this matter. I appreciate any help!

How do I get the length of a column of a DataFrame in Python?

Get the number of columns: len(df. columns) The number of columns of pandas. DataFrame can be obtained by applying len() to the columns attribute.

How do I find the length of a string in a DataFrame column?

To find the length of strings in a data frame you have the len method on the dataframes str property. But to do this you need to call this method on the column that contains the string data.

Can you use Len on a DataFrame?

Get Number of Rows in DataFrameYou can use len(df. index) to find the number of rows in pandas DataFrame, df. index returns RangeIndex(start=0, stop=8, step=1) and use it on len() to get the count.

In Spark >= 1.5 you can use size function:

from pyspark.sql.functions import col, size  df = sqlContext.createDataFrame([     (["L", "S", "Y", "S"],  ),     (["L", "V", "I", "S"],  ),     (["I", "A", "N", "A"],  ),     (["I", "L", "S", "A"],  ),     (["E", "N", "N", "Y"],  ),     (["E", "I", "M", "A"],  ),     (["O", "A", "N", "A"],  ),     (["S", "U", "S"],  )],      ("tokens", ))  df.where(size(col("tokens")) <= 3).show()  ## +---------+ ## |   tokens| ## +---------+ ## |[S, U, S]| ## +---------+

In Spark < 1.5 an UDF should do the trick:

from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf  size_ = udf(lambda xs: len(xs), IntegerType())  df.where(size_(col("tokens")) <= 3).show()  ## +---------+ ## |   tokens| ## +---------+ ## |[S, U, S]| ## +---------+

If you use HiveContext then size UDF with raw SQL should work with any version:

df.registerTempTable("df") sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()  ## +--------------------+ ## |              tokens| ## +--------------------+ ## |ArrayBuffer(S, U, S)| ## +--------------------+

For string columns you can either use an udf defined above or length function:

from pyspark.sql.functions import length  df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", )) df.where(length(col("k")) <= 3).show()  ## +---+ ## |  k| ## +---+ ## |bar| ## +---+

Here is an example for String in scala:

val stringData = Seq(("Maheswara"), ("Mokshith")) val df = sc.parallelize(stringData).toDF df.where((length($"value")) <= 8).show +--------+ |   value| +--------+ |Mokshith| +--------+ df.withColumn("length", length($"value")).show +---------+------+ |    value|length| +---------+------+ |Maheswara|     9| | Mokshith|     8| +---------+------+

Filtering DataFrame using the length of a column

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Alberto Bonsanto

People also ask

2 Answers

zero323

mputha

Recent Activity

Donate For Us

Filtering DataFrame using the length of a column

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Alberto Bonsanto

People also ask

2 Answers

zero323

mputha

Related questions

Recent Activity

Donate For Us