I have a Spark dataframe with a very large number of columns. I want to remove two columns from it to get a new dataframe.
Had there been fewer columns, I could have used the select method in the API like this:
pcomments = pcomments.select(pcomments.col("post_id"),pcomments.col("comment_id"),pcomments.col("comment_message"),pcomments.col("user_name"),pcomments.col("comment_createdtime"));
But since picking columns from a long list is a tedious task, is there a workaround?
Spark DataFrame provides a drop() method to drop a column/field from a DataFrame/Dataset. drop() method also used to remove multiple columns at a time from a Spark DataFrame/Dataset.
In pyspark the drop() function can be used to remove values/columns from the dataframe. thresh – This takes an integer value and drops rows that have less than that thresh hold non-null values.
We can use drop function to remove or delete columns from a DataFrame.
The Spark DataFrame provides the drop() method to drop the column or the field from the DataFrame or the Dataset. The drop() method is also used to remove the multiple columns from the Spark DataFrame or the Database.
Use drop method and withColumnRenamed methods.
Example:
val initialDf= .... val dfAfterDrop=initialDf.drop("column1").drop("coumn2") val dfAfterColRename= dfAfterDrop.withColumnRenamed("oldColumnName","new ColumnName")
Try this:
val initialDf = ... val dfAfterDropCols = initialDf.drop("column1", "coumn2")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With