Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python splitting text file keeping newlines

I am trying to split up a text file into words, with \n being counted as a word.

My input is this text file:

War and Peace

by Leo Tolstoy/Tolstoi

And I want a list output like this:

['War','and','Peace','\n','\n','by','Leo','Tolstoy/Tolstoi']

Using .split() I get this:

['War', 'and', 'Peace\n\nby', 'Leo', 'Tolstoy/Tolstoi']

So I started writing a program to put the \n as a separate entry after the word, code following:

for oldword in text:
counter = 0
newword = oldword
while "\n" in newword:
    newword = newword.replace("\n","",1)
    counter += 1

text[text.index(oldword)] = newword

while counter > 0:
    text.insert(text.index(newword)+1, "\n")
    counter -= 1

However, the program seems to hang on the line counter -= 1, and I can't for the life of me figure out why.

NOTE: I realise that were this to work, the result would be ['Peaceby',"\n","\n"]; that is a different problem to be solved later.

like image 896
Christopher Riches Avatar asked Mar 15 '23 02:03

Christopher Riches


1 Answers

You don't need such complicated way, You can simply use regex and re.findall() to find all the words and new lines:

>>> s="""War and Peace
... 
... by Leo Tolstoy/Tolstoi"""
>>> 
>>> re.findall(r'\S+|\n',s)
['War', 'and', 'Peace', '\n', '\n', 'by', 'Leo', 'Tolstoy/Tolstoi']

'\S+|\n' will match all the combinations of none whitespace character with length 1 or more (\S+) or new line (\n).

If you want to get the text from a file you can do the following:

with open('file_name') as f:
     re.findall(r'\S+|\n',f.read())

Read more about regular expressions http://www.regular-expressions.info/

like image 82
Mazdak Avatar answered Mar 24 '23 20:03

Mazdak