I am looking for a way to select columns of my dataframe in PySpark. For the first row, I know I can use df.first()
, but not sure about columns given that they do not have column names.
I have 5 columns and want to loop through each one of them.
+--+---+---+---+---+---+---+
|_1| _2| _3| _4| _5| _6| _7|
+--+---+---+---+---+---+---+
|1 |0.0|0.0|0.0|1.0|0.0|0.0|
|2 |1.0|0.0|0.0|0.0|0.0|0.0|
|3 |0.0|0.0|1.0|0.0|0.0|0.0|
Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns.
You can find all column names & data types (DataType) of PySpark DataFrame by using df. dtypes and df. schema and you can also retrieve the data type of a specific column name using df. schema["name"].
In general, we use "*" to select all the columns from a DataFrame, and another way is by using df. columns and map as shown below. In this first, by df.
In PySpark, select () function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select () is a transformation function hence it returns a new DataFrame with the selected columns. Select a Single & Multiple Columns from PySpark Select All Columns From List
The SELECT function selects the column from the database in a PySpark Data Frame. It is a transformation function that takes up the existing data frame and selects the data frame that is needed further. The selected data frame is put up into a new data frame.
You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select () function. Since DataFrame is immutable, this creates a new DataFrame with selected columns. show () function is used to show the Dataframe contents.
In this article, you have learned select () is a transformation function of the DataFrame and is used to select single, multiple columns, select all columns from the list, select by index, and finally select nested struct columns, you have also learned how to select nested elements from the DataFrame. Happy Learning !!
Try something like this:
df.select([c for c in df.columns if c in ['_2','_4','_5']]).show()
First two columns and 5 rows
df.select(df.columns[:2]).take(5)
You can use an array and unpack it inside the select:
cols = ['_2','_4','_5']
df.select(*cols).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With