I am trying to transform an entire df to a single vector column, using
df_vec = vectorAssembler.transform(df.drop('col200'))
I am being thrown this error:
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Cannot resolve column name "col200" among (col1, col2..
I looked around the internet and found out that the error could be caused because of some white spaces in the column headers. The problem is that there are around 1600 columns, and it's quite a task to check each one of them - especially for white spaces. How do I approach this? It's a df with around 800000 rows, FYI.
by doing df.printSchema(), I don't see any white spaces. Not leading at least. I am pretty positive that none of the column names have any spaces in between too.
At this point, I am totally blocked! Any help would be greatly appreciated.
Method 1: Using withColumnRenamed() We will use of withColumnRenamed() method to change the column names of pyspark data frame. existingstr: Existing column name of data frame to rename. newstr: New column name. Returns type: Returns a data frame by renaming an existing column.
DataFrame. selectExpr (*expr)[source] Projects a set of SQL expressions and returns a new DataFrame . This is a variant of select() that accepts SQL expressions.
In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.
In Pyspark, using the drop() function, we can drop a single column. Drop function with the column name as an argument will delete this particular column.
That happened to me a couple of times, try this:
tempList = [] #Edit01
for col in df.columns:
new_name = col.strip()
new_name = "".join(new_name.split())
new_name = new_name.replace('.','') # EDIT
tempList.append(new_name) #Edit02
print(tempList) #Just for the sake of it #Edit03
df = df.toDF(*tempList) #Edit04
The code trims and removes all whitespaces from every single column in your Dataframe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With