I'm trying to read one excel file which looks like below:
I also have one script which will convert this xlsx file into the csv files with its sheet names (If three sheets available then it will create three different csv files).
It's csv file looks like below:
Unnamed: 0,Gender A,Unnamed: 2,Gender B,Unnamed: 4,Gender C,Gender D
date,Male,Female,Male,Female,Male,Female
2017-01-01 00:00:00,2,3,3,2,3,3
2017-01-02 00:00:00,5,7,7,42,3,5
2017-01-03 00:00:00,4,6,6,12,2,7
2017-01-04 00:00:00,6,7,3,6,4,8
2017-01-05 00:00:00,6,8,8,3,5,3
2017-01-06 00:00:00,54,3,3,6,3,5
2017-01-07 00:00:00,3,4,6,3,6,5
2017-01-08 00:00:00,3,6,6,3,6,4
2017-01-09 00:00:00,2,2,8,7,5,2
2017-01-10 00:00:00,4,3,2,4,5,5
2017-01-11 00:00:00,12,10,10,3,1,6
2017-01-12 00:00:00,9,7,7,3,4,1
So, my first question is which is better choice to process these files - xlsx or csv?
Next, I just want to read first two rows as a column header. So that I can understand In which Gender how many males and females are available.
Expected Output:
0 date Gender A_Male Gender A_Female Gender B_Male Gender B_Female Gender C_Male Gender D_Female
1 2017-01-01 00:00:00 2 3 3 2 3 3
2 2017-01-02 00:00:00 5 7 7 42 3 5
3 2017-01-03 00:00:00 4 6 6 12 2 7
4 2017-01-04 00:00:00 6 7 3 6 4 8
5 2017-01-05 00:00:00 6 8 8 3 5 3
6 2017-01-06 00:00:00 54 3 3 6 3 5
7 2017-01-07 00:00:00 3 4 6 3 6 5
8 2017-01-08 00:00:00 3 6 6 3 6 4
9 2017-01-09 00:00:00 2 2 8 7 5 2
10 2017-01-10 00:00:00 4 3 2 4 5 5
11 2017-01-11 00:00:00 12 10 10 3 1 6
12 2017-01-12 00:00:00 9 7 7 3 4 1
We can use the concat function in pandas to append either columns or rows from one DataFrame to another. Let's grab two subsets of our data to see how this works. When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one.
Use concat() to Add a Row at Top of DataFrame Use pd. concat([new_row,df. loc[:]]). reset_index(drop=True) to add the row to the first position of the DataFrame as Index starts from zero.
Use DataFrame.append() method to concatenate DataFrames on rows. For E.x, df. append(df1) appends df1 to the df DataFrame.
Let's try:
df = pd.read_excel('Untitled 2.xlsx', header=[0,1])
df.columns = df.columns.map('_'.join)
df.rename_axis('Date').reset_index()
Output:
Date Gender A_Male Gender A_Female Gender B_Male Gender B_Female \
0 2017-01-01 2 3 3 2
1 2017-01-02 5 7 7 42
2 2017-01-03 4 6 6 12
3 2017-01-04 6 7 3 6
4 2017-01-05 6 8 8 3
5 2017-01-06 54 3 3 6
6 2017-01-07 3 4 6 3
7 2017-01-08 3 6 6 3
8 2017-01-09 2 2 8 7
9 2017-01-10 4 3 2 4
10 2017-01-11 12 10 10 3
11 2017-01-12 9 7 7 3
Gender C_Male Gender D_Female
0 3 3
1 3 5
2 2 7
3 4 8
4 5 3
5 3 5
6 6 5
7 6 4
8 5 2
9 5 5
10 1 6
11 4 1
I love @ScottBoston's approach. Here are some cosmetic alternatives. If you want your column headers to look pretty, especially when the second row contains the units for the quantity, you can do something like this:
df = pd.read_excel('Untitled 2.xlsx', header=[0,1], index_col=0)
df.columns = df.columns.map(lambda h: '{}\n({})'.format(h[0], h[1]))
df.rename_axis('Date')
And if you want to make sure your column names don't contain spaces (so you can access them as properties of the DataFrame):
df = pd.read_excel('Untitled 2.xlsx', header=[0,1], index_col=0)
df.columns = df.columns.map(lambda h: ' '.join(h).replace(' ', '_'))
df.rename_axis('Date')
Which gives:
Gender_A__Male Gender_A__Female ... Gender_C__Male Gender_D__Female
Date ...
2017-01-01 00:00:00 2 3 ... 3 3
2017-01-02 00:00:00 5 7 ... 3 5
2017-01-03 00:00:00 4 6 ... 2 7
2017-01-04 00:00:00 6 7 ... 4 8
2017-01-05 00:00:00 6 8 ... 5 3
2017-01-06 00:00:00 54 3 ... 3 5
2017-01-07 00:00:00 3 4 ... 6 5
2017-01-08 00:00:00 3 6 ... 6 4
2017-01-09 00:00:00 2 2 ... 5 2
2017-01-10 00:00:00 4 3 ... 5 5
2017-01-11 00:00:00 12 10 ... 1 6
2017-01-12 00:00:00 9 7 ... 4 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With