Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: TypeError: condition should be string or Column

I am trying to filter an RDD based like below:

spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)

But got the following errors:

TypeErrorTraceback (most recent call last)
<ipython-input-8-86cfb363dd8b> in <module>()
      1 spark_df = sc.createDataFrame(pandas_df)
----> 2 spark_df.filter(lambda r: str(r['target']).startswith('good'))
      3 spark_df.take(5)

/usr/local/spark-latest/python/pyspark/sql/dataframe.py in filter(self, condition)
    904             jdf = self._jdf.filter(condition._jc)
    905         else:
--> 906             raise TypeError("condition should be string or Column")
    907         return DataFrame(jdf, self.sql_ctx)
    908 

TypeError: condition should be string or Column

Any idea what I missed? Thank you!

like image 963
Edamame Avatar asked Oct 05 '16 17:10

Edamame


People also ask

What is withColumn PySpark?

PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.


1 Answers

DataFrame.filter, which is an alias for DataFrame.where, expects a SQL expression expressed either as a Column:

spark_df.filter(col("target").like("good%"))

or equivalent SQL string:

spark_df.filter("target LIKE 'good%'")

I believe you're trying here to use RDD.filter which is completely different method:

spark_df.rdd.filter(lambda r: r['target'].startswith('good'))

and does not benefit from SQL optimizations.

like image 161
zero323 Avatar answered Sep 19 '22 08:09

zero323