Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select and order multiple columns in a Pyspark Dataframe after a join

I want to select multiple columns from existing dataframe (which is created after joins) and would like to order the fileds as my target table structure. How can it be done ? The approached I have used is below. Here I am able to select the necessary columns required but not able to make in sequence.

Required (Target Table structure) :
hist_columns = ("acct_nbr","account_sk_id", "zip_code","primary_state", "eff_start_date" ,"eff_end_date","eff_flag")

account_sk_df = hist_process_df.join(broadcast(df_sk_lkp) ,'acct_nbr','inner' )
account_sk_df_ld = account_sk_df.select([c for c in account_sk_df.columns if c in hist_columns])

>>> account_sk_df
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, hash_sk_id: string, account_sk_id: int]


>>> account_sk_df_ld
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, account_sk_id: int]

The account_sk_id need to be in 2nd place. What's the best way to do this ?

like image 362
user3858193 Avatar asked Nov 07 '16 14:11

user3858193


People also ask

How do I select multiple columns from a Spark DataFrame PySpark?

You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns. show() function is used to show the Dataframe contents.

How do you drop multiple columns after join in PySpark?

Drop multiple column in pyspark using drop() function. Drop function with list of column names as argument drops those columns.

How do you rearrange columns in a DataFrame PySpark?

In order to Rearrange or reorder the column in pyspark we will be using select function. To reorder the column in ascending order we will be using Sorted function. To reorder the column in descending order we will be using Sorted function with an argument reverse =True. We also rearrange the column by position.


1 Answers

Try selecting columns by just giving a list, not by iterating existing columns and ordering should be OK:

account_sk_df_ld = account_sk_df.select(*hist_columns)
like image 85
Mariusz Avatar answered Oct 06 '22 11:10

Mariusz