I am trying to read some excel files in pandas. In some files, the table of interest is not perfectly formatted, i.e. multiple rows are formatted as a single row but each such row has multiple lines. So the data appears fine when you view the excel file. Also when parsing it using pandas, there is indeed a newline character (\n) at the end of each such line.
The problem is that when I read it with read_excel() function, it converts it into a DataFrame which does not consider this line break as a separate row but puts it into one row with \n in it. I would like to write a code that treats/converts each such row with N lines as N rows (using the line-breaks as an indicator for new row).
Is there a way to do it either while parsing the file or post-processing the dataframe in Python?
Here I provide a very simplified version of my dummy excel-file and some code to explain the problem.
Sample Excel-File:
Name | Price
-------------------------------
Coca Cola | 46.66
-------------------------------
Google | 1204.44
Facebook | 177.58
-------------------------------
Berkshire Hathaway | 306513.75
I simply use Pandas' read_excel in Python:
dataframe_parsed = pandas.read_excel(file_name)
print(dataframe_parsed.head())
I get the following DataFrame as output:
Name Price
0 Coca Cola 46.66
1 Google\nFacebook 1204.44\n177.58
2 Berkshire Hathaway 306513.75
The desired output is:
Name Price
0 Coca Cola 46.66
1 Google 1204.44
2 Facebook 177.58
3 Berkshire Hathaway 306513.75
Any help will be highly appreciated.
Use pandas. read_excel() function to read excel sheet into pandas DataFrame, by default it loads the first sheet from the excel file and parses the first row as a DataFrame column name. Excel file has an extension . xlsx.
We can also add multiple rows using the pandas. concat() by creating a new dataframe of all the rows that we need to add and then appending this dataframe to the original dataframe.
To read an excel file as a DataFrame, use the pandas read_excel() method. You can read the first sheet, specific sheets, multiple sheets or all sheets. Pandas converts this to the DataFrame structure, which is a tabular like structure.
After split
you can check with unnesting
yourdf=unnesting(df.apply(lambda x : x.str.split(r'\\n')),['Name','Price'])
yourdf
Out[50]:
Name Price
0 Coca Cola 46.66
1 Google 1204.44
1 Facebook 177.58
2 Berkshire Hathaway 306513.75
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
Since you mentioned above does not work
df.apply(lambda x : x.str.split(r'\\n')).stack().apply(pd.Series).stack().unstack(level=1).reset_index(drop=True)
Out[57]:
Name Price
0 Coca Cola 46.66
1 Google 1204.44
2 Facebook 177.58
3 Berkshire Hathaway 306513.75
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With