I'm loading data from HDFS, which I want to filter by specific variables. But somehow the Column.isin command does not work. It throws this error:
TypeError: 'Column' object is not callable
from pyspark.sql.functions import udf, col
variables = ('852-PI-769', '812-HC-037', '852-PC-571-OUT')
df = sqlContext.read.option("mergeSchema", "true").parquet("parameters.parquet")
same_var = col("Variable").isin(variables)
df2 = df.filter(same_var)
The schema looks like this:
df.printSchema()
root
|-- Time: timestamp (nullable = true)
|-- Value: float (nullable = true)
|-- Variable: string (nullable = true)
Any idea what am I doing wrong? PS: It's Spark 1.4 with Jupyter Notebook.
iterrows() This method is used to iterate the columns in the given PySpark DataFrame. It can be used with for loop and takes column names through the row iterator and index to iterate columns.
col (col: str) → pyspark.sql.column.Column[source] Returns a Column based on the given column name.'
The TypeError 'DataFrame' object is not callable occurs when you try to call a DataFrame as if it were a function. TypeErrors occur when you attempt to perform an illegal operation for a specific data type. To solve this error, ensure that there are no parentheses after the DataFrames in your code.
In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.
The problem is that isin
was added to Spark in version 1.5.0 and therefore not yet avaiable in your version of Spark as seen in the documentation of isin
here.
There is a similar function in
in the Scala API that was introduced in 1.3.0 which has a similar functionality (there are some differences in the input since in
only accepts columns). In PySpark this function is called inSet
instead. Usage examples from the documentation:
df[df.name.inSet("Bob", "Mike")] df[df.age.inSet([1, 2, 3])]
Note: inSet
is depricated in version 1.5.0 and forward, isin
should be used in newer versions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With