Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to exclude multiple columns in Spark dataframe in Python

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?

df.drop(['col1','col2']) 
TypeError                                 Traceback (most recent call last) <ipython-input-96-653b0465e457> in <module>() ----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])  /usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)    1257             jdf = self._jdf.drop(col._jc)    1258         else: -> 1259             raise TypeError("col should be a string or a Column")    1260         return DataFrame(jdf, self.sql_ctx)    1261   TypeError: col should be a string or a Column 
like image 919
MYjx Avatar asked Feb 27 '16 19:02

MYjx


People also ask

How do you exclude columns in Pyspark?

In pyspark the drop() function can be used to remove values/columns from the dataframe.


1 Answers

In PySpark 2.1.0 method drop supports multiple columns:

PySpark 2.0.2:

DataFrame.drop(col) 

PySpark 2.1.0:

DataFrame.drop(*cols) 

Example:

df.drop('col1', 'col2') 

or using the * operator as

df.drop(*['col1', 'col2']) 
like image 200
Patrick Z Avatar answered Sep 19 '22 17:09

Patrick Z