Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: How to read a DataFrame from excel-file where multiple rows are sometimes separated by line break (\n)

I am trying to read some excel files in pandas. In some files, the table of interest is not perfectly formatted, i.e. multiple rows are formatted as a single row but each such row has multiple lines. So the data appears fine when you view the excel file. Also when parsing it using pandas, there is indeed a newline character (\n) at the end of each such line.

The problem is that when I read it with read_excel() function, it converts it into a DataFrame which does not consider this line break as a separate row but puts it into one row with \n in it. I would like to write a code that treats/converts each such row with N lines as N rows (using the line-breaks as an indicator for new row).

Is there a way to do it either while parsing the file or post-processing the dataframe in Python?

Here I provide a very simplified version of my dummy excel-file and some code to explain the problem.

Sample Excel-File:

Name                | Price
-------------------------------
Coca Cola           |     46.66
-------------------------------
Google              |   1204.44
Facebook            |    177.58
-------------------------------
Berkshire Hathaway  | 306513.75

I simply use Pandas' read_excel in Python:

dataframe_parsed = pandas.read_excel(file_name)
print(dataframe_parsed.head())

I get the following DataFrame as output:

                 Name            Price
0           Coca Cola            46.66
1    Google\nFacebook  1204.44\n177.58
2  Berkshire Hathaway        306513.75

The desired output is:

                 Name           Price
0           Coca Cola           46.66
1              Google         1204.44
2            Facebook          177.58
3  Berkshire Hathaway       306513.75

Any help will be highly appreciated.

like image 458
Frida Schenker Avatar asked Apr 10 '19 16:04

Frida Schenker


People also ask

How view specific rows from pandas excel?

Use pandas. read_excel() function to read excel sheet into pandas DataFrame, by default it loads the first sheet from the excel file and parses the first row as a DataFrame column name. Excel file has an extension . xlsx.

Which method in pandas can be used to add multiple rows to a DataFrame?

We can also add multiple rows using the pandas. concat() by creating a new dataframe of all the rows that we need to add and then appending this dataframe to the original dataframe.

How do you read data from excel file in python using pandas?

To read an excel file as a DataFrame, use the pandas read_excel() method. You can read the first sheet, specific sheets, multiple sheets or all sheets. Pandas converts this to the DataFrame structure, which is a tabular like structure.


1 Answers

After split you can check with unnesting

yourdf=unnesting(df.apply(lambda x : x.str.split(r'\\n')),['Name','Price'])
yourdf
Out[50]: 
                 Name      Price
0           Coca Cola      46.66
1              Google    1204.44
1            Facebook     177.58
2  Berkshire Hathaway  306513.75

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

Since you mentioned above does not work

df.apply(lambda x : x.str.split(r'\\n')).stack().apply(pd.Series).stack().unstack(level=1).reset_index(drop=True)
Out[57]: 
                 Name      Price
0           Coca Cola      46.66
1              Google    1204.44
2            Facebook     177.58
3  Berkshire Hathaway  306513.75
like image 73
BENY Avatar answered Sep 28 '22 12:09

BENY