Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark error: AnalysisException: 'Cannot resolve column name

I am trying to transform an entire df to a single vector column, using

df_vec = vectorAssembler.transform(df.drop('col200'))

I am being thrown this error:

File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: 'Cannot resolve column name "col200" among (col1, col2..

I looked around the internet and found out that the error could be caused because of some white spaces in the column headers. The problem is that there are around 1600 columns, and it's quite a task to check each one of them - especially for white spaces. How do I approach this? It's a df with around 800000 rows, FYI.

by doing df.printSchema(), I don't see any white spaces. Not leading at least. I am pretty positive that none of the column names have any spaces in between too.

At this point, I am totally blocked! Any help would be greatly appreciated.

like image 688
Anonymous Person Avatar asked Apr 01 '19 10:04

Anonymous Person


People also ask

How do I rename a column in PySpark?

Method 1: Using withColumnRenamed() We will use of withColumnRenamed() method to change the column names of pyspark data frame. existingstr: Existing column name of data frame to rename. newstr: New column name. Returns type: Returns a data frame by renaming an existing column.

What is selectExpr in PySpark?

DataFrame. selectExpr (*expr)[source] Projects a set of SQL expressions and returns a new DataFrame . This is a variant of select() that accepts SQL expressions.

What is withColumn in PySpark?

In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.

How do you drop a column in PySpark?

In Pyspark, using the drop() function, we can drop a single column. Drop function with the column name as an argument will delete this particular column.


1 Answers

That happened to me a couple of times, try this:

tempList = [] #Edit01
    for col in df.columns:
        new_name = col.strip()
        new_name = "".join(new_name.split())
        new_name = new_name.replace('.','') # EDIT
        tempList.append(new_name) #Edit02
print(tempList) #Just for the sake of it #Edit03

df = df.toDF(*tempList) #Edit04

The code trims and removes all whitespaces from every single column in your Dataframe.

like image 137
Manrique Avatar answered Oct 21 '22 01:10

Manrique