Say I have the following Excel file:
A B C
0 - - -
1 Start - -
2 3 2 4
3 7 8 4
4 11 2 17
I want to read the file in a dataframe making sure that I start to read it below the row where the Start
value is.
Attention: the Start
value is not always located in the same row, so if I were to use:
import pandas as pd
xls = pd.ExcelFile('C:\Users\MyFolder\MyFile.xlsx')
df = xls.parse('Sheet1', skiprows=4, index_col=None)
this would fail as skiprows
needs to be fixed. Is there any workaround to make sure that xls.parse
finds the string value instead of the row number?
To read an excel file as a DataFrame, use the pandas read_excel() method. You can read the first sheet, specific sheets, multiple sheets or all sheets. Pandas converts this to the DataFrame structure, which is a tabular like structure.
In order to perform this task, we will be using the Openpyxl module in python. Openpyxl is a Python library for reading and writing Excel (with extension xlsx/xlsm/xltx/xltm) files. The openpyxl module allows a Python program to read and modify Excel files.
To read Excel files in Python’s Pandas, use the read_excel () function. You can specify the path to the file and a sheet name to read, as shown below:
Pandas Solutions The simplest solution for this data set is to use the header and usecols arguments to read_excel (). The usecols parameter, in particular, can be very useful for controlling the columns you would like to include. If you would like to follow along with these examples, the file is on github.
This allows you to quickly load the file to better be able to explore the different columns and data types. This can be done using the nrows= parameter, which accepts an integer value of the number of rows you want to read into your DataFrame. Let’s see how we can read the first five rows of the Excel sheet:
Now we can import the excel file using the read_excel function in pandas, as shown below: The second statement reads the data from excel and stores it into a pandas Data Frame which is represented by the variable newData. If there are multiple sheets in the excel workbook, the command will import data of the first sheet.
df = pd.read_excel('your/path/filename')
This answer helps in finding the location of 'start' in the df
for row in range(df.shape[0]):
for col in range(df.shape[1]):
if df.iat[row,col] == 'start':
row_start = row
break
after having row_start you can use subframe of pandas
df_required = df.loc[row_start:]
And if you don't need the row containing 'start', just u increment row_start by 1
df_required = df.loc[row_start+1:]
If you know the specific rows you are interested in, you can skip from the top using skiprow
and then parse only the row (or rows) you want using nrows
- see pandas.read_excel
df = pd.read_excel('myfile.xlsx', 'Sheet1', skiprows=2, nrows=3,)
You could use pd.read_excel('C:\Users\MyFolder\MyFile.xlsx', sheet_name='Sheet1')
as it ignores empty excel cells.
Your DataFrame should then look like this:
A B C
0 Start NaN NaN
1 3 2 4
2 7 8 4
3 11 2 17
Then drop the first row by using
df.drop([0])
to get
A B C
0 3 2 4
1 7 8 4
2 11 2 17
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With