Pandas: How to read a DataFrame from excel-file where multiple rows are sometimes separated by line break (\n)

Tags:

I am trying to read some excel files in pandas. In some files, the table of interest is not perfectly formatted, i.e. multiple rows are formatted as a single row but each such row has multiple lines. So the data appears fine when you view the excel file. Also when parsing it using pandas, there is indeed a newline character (\n) at the end of each such line.

The problem is that when I read it with read_excel() function, it converts it into a DataFrame which does not consider this line break as a separate row but puts it into one row with \n in it. I would like to write a code that treats/converts each such row with N lines as N rows (using the line-breaks as an indicator for new row).

Is there a way to do it either while parsing the file or post-processing the dataframe in Python?

Here I provide a very simplified version of my dummy excel-file and some code to explain the problem.

Sample Excel-File:

Name                | Price
-------------------------------
Coca Cola           |     46.66
-------------------------------
Google              |   1204.44
Facebook            |    177.58
-------------------------------
Berkshire Hathaway  | 306513.75

I simply use Pandas' read_excel in Python:

dataframe_parsed = pandas.read_excel(file_name)
print(dataframe_parsed.head())

I get the following DataFrame as output:

                 Name            Price
0           Coca Cola            46.66
1    Google\nFacebook  1204.44\n177.58
2  Berkshire Hathaway        306513.75

The desired output is:

                 Name           Price
0           Coca Cola           46.66
1              Google         1204.44
2            Facebook          177.58
3  Berkshire Hathaway       306513.75

Any help will be highly appreciated.

458

asked Apr 10 '19 16:04

Frida Schenker

1 Answers

After split you can check with unnesting

yourdf=unnesting(df.apply(lambda x : x.str.split(r'\\n')),['Name','Price'])
yourdf
Out[50]: 
                 Name      Price
0           Coca Cola      46.66
1              Google    1204.44
1            Facebook     177.58
2  Berkshire Hathaway  306513.75

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

Since you mentioned above does not work

df.apply(lambda x : x.str.split(r'\\n')).stack().apply(pd.Series).stack().unstack(level=1).reset_index(drop=True)
Out[57]: 
                 Name      Price
0           Coca Cola      46.66
1              Google    1204.44
2            Facebook     177.58
3  Berkshire Hathaway  306513.75

answered Sep 28 '22 12:09

BENY

Related questions
                            
                                Remove Twitter mentions from Pandas column
                            
                                Unable to use proxies one by one until there is a valid response
                            
                                sklearn min_impurity_decrease explanation
                            
                                Why does os.symlink uses path relative to destination?
                            
                                Insert cells in empty Pandas DataFrame
                            
                                Finding n lowest values for each row in a dataframe
                            
                                With AWS SageMaker, is it possible to deploy a pre-trained model using the sagemaker SDK?
                            
                                How to plot the slope (tangent line) of parabola at any point?
                            
                                How to get datediff() in seconds in pyspark?
                            
                                how to reflect an existing table by using flask_sqlalchemy
                            
                                Why can I call Fortran subroutine through f2py without having right number of inputs?
                            
                                Plotly: How to make stacked bar chart from single trace?
                            
                                python setuptools compile fortran code and make an entry points
                            
                                Adding value of single numpy array to all columns in other numpy array [duplicate]
                            
                                How do I scrape image-src in beautifulsoup
                            
                                How to align text left on a plotly bar chart (example image contained) [Plotly-Dash]
                            
                                My RST README is not formatted correctly on PyPi
                            
                                Issue in installing pysqlcipher3
                            
                                How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?
                            
                                How can I connect to the database with pypika?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: How to read a DataFrame from excel-file where multiple rows are sometimes separated by line break (\n)

Tags:

python

pandas

dataframe

parsing

excel

Frida Schenker

People also ask

1 Answers

BENY

Recent Activity

Donate For Us