Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select columns in PySpark dataframe

I am looking for a way to select columns of my dataframe in PySpark. For the first row, I know I can use df.first(), but not sure about columns given that they do not have column names.

I have 5 columns and want to loop through each one of them.

+--+---+---+---+---+---+---+
|_1| _2| _3| _4| _5| _6| _7|
+--+---+---+---+---+---+---+
|1 |0.0|0.0|0.0|1.0|0.0|0.0|
|2 |1.0|0.0|0.0|0.0|0.0|0.0|
|3 |0.0|0.0|1.0|0.0|0.0|0.0|
like image 443
Nivi Avatar asked Oct 18 '17 14:10

Nivi


People also ask

How do you select certain columns in PySpark?

Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns.

How do I get the columns of a DataFrame in PySpark?

You can find all column names & data types (DataType) of PySpark DataFrame by using df. dtypes and df. schema and you can also retrieve the data type of a specific column name using df. schema["name"].

How do I select all columns in Spark DataFrame?

In general, we use "*" to select all the columns from a DataFrame, and another way is by using df. columns and map as shown below. In this first, by df.

How to select a single&multiple columns from pyspark Dataframe?

In PySpark, select () function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select () is a transformation function hence it returns a new DataFrame with the selected columns. Select a Single & Multiple Columns from PySpark Select All Columns From List

What is the use of select function in pyspark?

The SELECT function selects the column from the database in a PySpark Data Frame. It is a transformation function that takes up the existing data frame and selects the data frame that is needed further. The selected data frame is put up into a new data frame.

How do I select a specific column in a Dataframe?

You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select () function. Since DataFrame is immutable, this creates a new DataFrame with selected columns. show () function is used to show the Dataframe contents.

What is the use of select () function in Dataframe?

In this article, you have learned select () is a transformation function of the DataFrame and is used to select single, multiple columns, select all columns from the list, select by index, and finally select nested struct columns, you have also learned how to select nested elements from the DataFrame. Happy Learning !!


3 Answers

Try something like this:

df.select([c for c in df.columns if c in ['_2','_4','_5']]).show()
like image 112
MaxU - stop WAR against UA Avatar answered Sep 30 '22 13:09

MaxU - stop WAR against UA


First two columns and 5 rows

 df.select(df.columns[:2]).take(5)
like image 41
Michael West Avatar answered Sep 30 '22 12:09

Michael West


You can use an array and unpack it inside the select:

cols = ['_2','_4','_5']
df.select(*cols).show()
like image 26
Shadowtrooper Avatar answered Sep 30 '22 11:09

Shadowtrooper