Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a python dataframe based on new line characters?

I have pandas dataframe in which a column contains paragraphs of text. I wanted to explode the dataframe into separate columns by splitting the paragraphs of text into newlines. The paragraph of text may contain multiple new lines.

Example dataframe:

Current output:
A
foo bar
foo bar\nfoo bar
foo bar
foo bar

Desired output:

   A         B                                                      
0 foo bar                                                  
1 foo bar   foo bar                                                 
2 foo bar                                                  
3 foo bar                                                  

I have tried using this:

df.A.str.split(expand=True))

But it is splitting at every whitespace not "/n" as expected.

like image 265
Adam Choy Avatar asked Sep 10 '25 15:09

Adam Choy


2 Answers

As stated in the docs you should be able to specify the delimiter to split on as the (optional) parameter of the split method par, otherwise it will split on whitespaces only:

"String or regular expression to split on. If not specified, split on whitespace."

Therefore you may do the following to achive the newline-splitting feature:

df.A.str.split(pat="\n", expand=True)
like image 102
Drumstick Avatar answered Sep 13 '25 11:09

Drumstick


You have to pass the pattern on which to split the string as an argument to series.str.split(). Here is a complete reproducible example that works on Windows systems:

import pandas as pd

df = pd.DataFrame({'A': ['foo bar', 
                         'foo bar\nfoo bar',
                         'foo bar',
                         'foo bar']})

df.A.str.split(pat='\n', expand=True)
    0           1
0   foo bar     None
1   foo bar     foo bar
2   foo bar     None
3   foo bar     None

For a platform-independent solution, I would do something similar to @ThePyGuy's answer, but with str.splitlines(), because this method will recognize line boundaries from various systems.

df.A.apply(str.splitlines).apply(pd.Series).fillna('')
like image 39
Arne Avatar answered Sep 13 '25 11:09

Arne