PySpark: TypeError: 'Column' object is not callable

Tags:

I'm loading data from HDFS, which I want to filter by specific variables. But somehow the Column.isin command does not work. It throws this error:

TypeError: 'Column' object is not callable

from pyspark.sql.functions import udf, col
variables = ('852-PI-769', '812-HC-037', '852-PC-571-OUT')
df = sqlContext.read.option("mergeSchema", "true").parquet("parameters.parquet")
same_var = col("Variable").isin(variables)
df2 = df.filter(same_var)

The schema looks like this:

df.printSchema()
root
 |-- Time: timestamp (nullable = true)
 |-- Value: float (nullable = true)
 |-- Variable: string (nullable = true)

Any idea what am I doing wrong? PS: It's Spark 1.4 with Jupyter Notebook.

766

asked Sep 07 '16 10:09

Matthias

1 Answers

The problem is that isin was added to Spark in version 1.5.0 and therefore not yet avaiable in your version of Spark as seen in the documentation of isin here.

There is a similar function in in the Scala API that was introduced in 1.3.0 which has a similar functionality (there are some differences in the input since in only accepts columns). In PySpark this function is called inSetinstead. Usage examples from the documentation:

df[df.name.inSet("Bob", "Mike")]
df[df.age.inSet([1, 2, 3])]

Note: inSet is depricated in version 1.5.0 and forward, isin should be used in newer versions.

answered Sep 21 '22 05:09

Shaido

Related questions
                            
                                Python plot Large matrix using matplotlib
                            
                                Deploying Python and Dependencies to Elastic Beanstalk
                            
                                Rounding a list of values to the nearest value from another list in python
                            
                                How to implement LIFO for multiprocessing.Queue in python?
                            
                                How do I replicate the way PyCharm is running my Python 3.4 project at the command line?
                            
                                Use one attribute only once in scikit-learn decision tree in python
                            
                                Flask API not receiving requests all of a sudden
                            
                                Celery Tasks with eta get removed from RabbitMQ
                            
                                python selenium send_keys emoticons characters
                            
                                Celery's expires option doesn't work
                            
                                TimeSeries in Bokeh using a dataframe with index
                            
                                How to use django translation with GAE?
                            
                                IPython Auto Scroll?
                            
                                Alternative to Mayavi for scientific 3d plotting
                            
                                Django - execute code at start-up
                            
                                Why doesn't my pandas rolling().apply() work when the series contains collections?
                            
                                understanding scikit learn Random Forest memory requirement for prediction
                            
                                Matrix legend in matplotlib (Python)
                            
                                Generate Fortran subroutine with SymPy codegen for a system of equations
                            
                                Jupyter Notebook: Multiple notebook to one kernel?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark: TypeError: 'Column' object is not callable

Tags:

python

apache-spark

pyspark

spark-dataframe

Matthias

People also ask

1 Answers

Shaido

Recent Activity

Donate For Us