Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: TypeError: 'Column' object is not callable

I'm loading data from HDFS, which I want to filter by specific variables. But somehow the Column.isin command does not work. It throws this error:

TypeError: 'Column' object is not callable

from pyspark.sql.functions import udf, col
variables = ('852-PI-769', '812-HC-037', '852-PC-571-OUT')
df = sqlContext.read.option("mergeSchema", "true").parquet("parameters.parquet")
same_var = col("Variable").isin(variables)
df2 = df.filter(same_var)

The schema looks like this:

df.printSchema()
root
 |-- Time: timestamp (nullable = true)
 |-- Value: float (nullable = true)
 |-- Variable: string (nullable = true)

Any idea what am I doing wrong? PS: It's Spark 1.4 with Jupyter Notebook.

like image 766
Matthias Avatar asked Sep 07 '16 10:09

Matthias


People also ask

How do I iterate over a PySpark column?

iterrows() This method is used to iterate the columns in the given PySpark DataFrame. It can be used with for loop and takes column names through the row iterator and index to iterate columns.

What is Col function in PySpark?

col (col: str) → pyspark.sql.column.Column[source] Returns a Column based on the given column name.'

What does DataFrame object is not callable mean?

The TypeError 'DataFrame' object is not callable occurs when you try to call a DataFrame as if it were a function. TypeErrors occur when you attempt to perform an illegal operation for a specific data type. To solve this error, ensure that there are no parentheses after the DataFrames in your code.

How do you use isNULL in PySpark?

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.


1 Answers

The problem is that isin was added to Spark in version 1.5.0 and therefore not yet avaiable in your version of Spark as seen in the documentation of isin here.

There is a similar function in in the Scala API that was introduced in 1.3.0 which has a similar functionality (there are some differences in the input since in only accepts columns). In PySpark this function is called inSetinstead. Usage examples from the documentation:

df[df.name.inSet("Bob", "Mike")]
df[df.age.inSet([1, 2, 3])]

Note: inSet is depricated in version 1.5.0 and forward, isin should be used in newer versions.

like image 98
Shaido Avatar answered Sep 21 '22 05:09

Shaido