How to delete columns in pyspark dataframe

People also ask

How do I drop multiple columns in a DataFrame PySpark?

Drop multiple column in pyspark using drop() function. Drop function with list of column names as argument drops those columns.

How do you delete duplicate columns in PySpark?

Removing duplicate columns after join in PySpark If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.

How do you drop NULL columns in PySpark?

In order to remove Rows with NULL values on selected columns of PySpark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.

Reading the Spark documentation I found an easier solution.

Since version 1.4 of spark there is a function drop(col) which can be used in pyspark on a dataframe.

You can use it in two ways

df.drop('age')
df.drop(df.age)

Pyspark Documentation - Drop

Adding to @Patrick's answer, you can use the following to drop multiple columns

columns_to_drop = ['id', 'id_copy']
df = df.drop(*columns_to_drop)

An easy way to do this is to user "select" and realize you can get a list of all columns for the dataframe, df, with df.columns

drop_list = ['a column', 'another column', ...]

df.select([column for column in df.columns if column not in drop_list])

You can use two way:

1: You just keep the necessary columns:

drop_column_list = ["drop_column"]
df = df.select([column for column in df.columns if column not in drop_column_list])

2: This is the more elegant way.

df = df.drop("col_name")

You should avoid the collect() version, because it will send to the master the complete dataset, it will take a big computing effort!

You could either explicitly name the columns you want to keep, like so:

keep = [a.id, a.julian_date, a.user_id, b.quan_created_money, b.quan_created_cnt]

Or in a more general approach you'd include all columns except for a specific one via a list comprehension. For example like this (excluding the id column from b):

keep = [a[c] for c in a.columns] + [b[c] for c in b.columns if c != 'id']

Finally you make a selection on your join result:

d = a.join(b, a.id==b.id, 'outer').select(*keep)

Related questions
                            
                                What are the benefits of Apache Beam over Spark/Flink for batch processing?
                            
                                Renaming column names of a DataFrame in Spark Scala
                            
                                Apache Spark: How to use pyspark with Python 3
                            
                                Spark Error - Unsupported class file major version
                            
                                How to tune spark executor number, cores and executor memory?
                            
                                What does "Stage Skipped" mean in Apache Spark web UI?
                            
                                Convert pyspark string to date format
                            
                                Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?
                            
                                Best way to get the max value in a Spark dataframe column
                            
                                java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. spark Eclipse on windows 7
                            
                                Extract column values of Dataframe as List in Apache Spark
                            
                                How to create an empty DataFrame with a specified schema?
                            
                                Can apache spark run without hadoop?
                            
                                Spark Dataframe distinguish columns with duplicated name
                            
                                What do the numbers on the progress bar mean in spark-shell?
                            
                                Spark - Error "A master URL must be set in your configuration" when submitting an app
                            
                                Spark DataFrame groupBy and sort in the descending order (pyspark)
                            
                                How to load local file in sc.textFile, instead of HDFS
                            
                                Load CSV file with Spark
                            
                                How to kill a running Spark application?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to delete columns in pyspark dataframe

Tags:

apache-spark

apache-spark-sql

pyspark

People also ask

Recent Activity

Donate For Us