Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 1.6: drop column in DataFrame with escaped column names

Trying to drop a column in a DataFrame, but i have column names with dots in them, which I escaped.

Before I escape, my schema looks like this:

root
 |-- user_id: long (nullable = true)
 |-- hourOfWeek: string (nullable = true)
 |-- observed: string (nullable = true)
 |-- raw.hourOfDay: long (nullable = true)
 |-- raw.minOfDay: long (nullable = true)
 |-- raw.dayOfWeek: long (nullable = true)
 |-- raw.sensor2: long (nullable = true)

If I try to drop a column, I get:

df = df.drop("hourOfWeek")
org.apache.spark.sql.AnalysisException: cannot resolve 'raw.hourOfDay' given input columns raw.dayOfWeek, raw.sensor2, observed, raw.hourOfDay, hourOfWeek, raw.minOfDay, user_id;
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)

Note that I'm not even trying to drop on the columns with dots in name. Since I couldn't seem to do much without escaping the column names, I converted the schema to:

root
 |-- user_id: long (nullable = true)
 |-- hourOfWeek: string (nullable = true)
 |-- observed: string (nullable = true)
 |-- `raw.hourOfDay`: long (nullable = true)
 |-- `raw.minOfDay`: long (nullable = true)
 |-- `raw.dayOfWeek`: long (nullable = true)
 |-- `raw.sensor2`: long (nullable = true)

but that doesn't seem to help. I still get the same error.

I tried escaping all column names, and drop using the escaped name, but that doesn't work either.

root
 |-- `user_id`: long (nullable = true)
 |-- `hourOfWeek`: string (nullable = true)
 |-- `observed`: string (nullable = true)
 |-- `raw.hourOfDay`: long (nullable = true)
 |-- `raw.minOfDay`: long (nullable = true)
 |-- `raw.dayOfWeek`: long (nullable = true)
 |-- `raw.sensor2`: long (nullable = true)

df.drop("`hourOfWeek`")
org.apache.spark.sql.AnalysisException: cannot resolve 'user_id' given input columns `user_id`, `raw.dayOfWeek`, `observed`, `raw.minOfDay`, `raw.hourOfDay`, `raw.sensor2`, `hourOfWeek`;
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)

Is there another way to drop a column that would not fail on this type of data?

like image 811
MrE Avatar asked Mar 14 '16 22:03

MrE


People also ask

How do I drop multiple columns in Spark DataFrame?

The Spark DataFrame provides the drop() method to drop the column or the field from the DataFrame or the Dataset. The drop() method is also used to remove the multiple columns from the Spark DataFrame or the Database.

How do I drop the last column in PySpark DataFrame?

2.2 Using drop() You can also use DataFrame. drop() method to delete the last n columns. Use axis=1 to specify the columns and inplace=True to apply the change on the existing DataFrame. On below example df.


2 Answers

Alright, I seem to have found the solution after all:

df.drop(df.col("raw.hourOfWeek")) seems to work

like image 92
MrE Avatar answered Nov 15 '22 20:11

MrE


val data = df.drop("Customers");

will work fine for normal columns

val new = df.drop(df.col("old.column"));
like image 30
sai chaithanya Avatar answered Nov 15 '22 21:11

sai chaithanya