I am trying to filter an RDD based like below: <pre class="prettyprint lang-py prettyprint-override"><code>spark_df = sc.createDataFrame(pandas_df) spark_df.filter(lambda r: str(r['target']).startswith('good')) spark_df.take(5) </code></pre> But got the following errors: <pre class="prettyprint lang-none prettyprint-override"><code>TypeErrorTraceback (most recent call last) <ipython-input-8-86cfb363dd8b> in <module>() 1 spark_df = sc.createDataFrame(pandas_df) ----> 2 spark_df.filter(lambda r: str(r['target']).startswith('good')) 3 spark_df.take(5) /usr/local/spark-latest/python/pyspark/sql/dataframe.py in filter(self, condition) 904 jdf = self._jdf.filter(condition._jc) 905 else: --> 906 raise TypeError("condition should be string or Column") 907 return DataFrame(jdf, self.sql_ctx) 908 TypeError: condition should be string or Column </code></pre> Any idea what I missed? Thank you!

<code>DataFrame.filter</code>, which is an alias for <code>DataFrame.where</code>, expects a SQL expression expressed either as a <code>Column</code>: <pre class="prettyprint"><code>spark_df.filter(col("target").like("good%")) </code></pre> or equivalent SQL string: <pre class="prettyprint"><code>spark_df.filter("target LIKE 'good%'") </code></pre> I believe you're trying here to use <code>RDD.filter</code> which is completely different method: <pre class="prettyprint"><code>spark_df.rdd.filter(lambda r: r['target'].startswith('good')) </code></pre> and does not benefit from SQL optimizations.

PySpark: TypeError: condition should be string or Column

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

I am trying to filter an RDD based like below:

spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)

But got the following errors:

TypeErrorTraceback (most recent call last)
<ipython-input-8-86cfb363dd8b> in <module>()
      1 spark_df = sc.createDataFrame(pandas_df)
----> 2 spark_df.filter(lambda r: str(r['target']).startswith('good'))
      3 spark_df.take(5)

/usr/local/spark-latest/python/pyspark/sql/dataframe.py in filter(self, condition)
    904             jdf = self._jdf.filter(condition._jc)
    905         else:
--> 906             raise TypeError("condition should be string or Column")
    907         return DataFrame(jdf, self.sql_ctx)
    908 

TypeError: condition should be string or Column

Any idea what I missed? Thank you!

963

asked Oct 05 '16 17:10

Edamame

1 Answers

DataFrame.filter, which is an alias for DataFrame.where, expects a SQL expression expressed either as a Column:

spark_df.filter(col("target").like("good%"))

or equivalent SQL string:

spark_df.filter("target LIKE 'good%'")

I believe you're trying here to use RDD.filter which is completely different method:

spark_df.rdd.filter(lambda r: r['target'].startswith('good'))

and does not benefit from SQL optimizations.

161

answered Sep 19 '22 08:09

zero323

Related questions
                            
                                Prevent function overriding in Python [duplicate]
                            
                                how to match whitespace and alphanumeric characters in python
                            
                                libmysqlclient.18.dylib image not found when using MySQL from Django on OS X
                            
                                django global variable
                            
                                Matplotlib: Color-coded text in legend instead of a line
                            
                                How to install win32com module in a virtualenv?
                            
                                Search File And Find Exact Match And Print Line?
                            
                                Speed up web scraper
                            
                                Is it possible for Scrapy to get plain text from raw HTML data?
                            
                                Is there a way to store python objects directly in mongoDB without serializing them
                            
                                How to turn pandas dataframe row into ordereddict fast
                            
                                Regression with multi-dimensional targets
                            
                                statsmodels linear regression - patsy formula to include all predictors in model
                            
                                excluding rows from a pandas dataframe based on column value and not index value
                            
                                use python list comprehension to update dictionary value
                            
                                python pdb automatic pretty-printing
                            
                                PyPDF 2 Decrypt Not Working
                            
                                Why isn't my Django User Model's Password Hashed?
                            
                                Return openpyxl workbook object as HttpResponse in django. Is it possible?
                            
                                ImportError: cannot import name 'QtCore'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With