Drop multiple column in pyspark using drop() function. Drop function with list of column names as argument drops those columns.
Removing duplicate columns after join in PySpark If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.
In order to remove Rows with NULL values on selected columns of PySpark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.
Reading the Spark documentation I found an easier solution.
Since version 1.4 of spark there is a function drop(col)
which can be used in pyspark on a dataframe.
You can use it in two ways
df.drop('age')
df.drop(df.age)
Pyspark Documentation - Drop
Adding to @Patrick's answer, you can use the following to drop multiple columns
columns_to_drop = ['id', 'id_copy']
df = df.drop(*columns_to_drop)
An easy way to do this is to user "select
" and realize you can get a list of all columns
for the dataframe
, df
, with df.columns
drop_list = ['a column', 'another column', ...]
df.select([column for column in df.columns if column not in drop_list])
You can use two way:
1: You just keep the necessary columns:
drop_column_list = ["drop_column"]
df = df.select([column for column in df.columns if column not in drop_column_list])
2: This is the more elegant way.
df = df.drop("col_name")
You should avoid the collect() version, because it will send to the master the complete dataset, it will take a big computing effort!
You could either explicitly name the columns you want to keep, like so:
keep = [a.id, a.julian_date, a.user_id, b.quan_created_money, b.quan_created_cnt]
Or in a more general approach you'd include all columns except for a specific one via a list comprehension. For example like this (excluding the id
column from b
):
keep = [a[c] for c in a.columns] + [b[c] for c in b.columns if c != 'id']
Finally you make a selection on your join result:
d = a.join(b, a.id==b.id, 'outer').select(*keep)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With