I am trying to split up a text file into words, with \n
being counted as a word.
My input is this text file:
War and Peace
by Leo Tolstoy/Tolstoi
And I want a list output like this:
['War','and','Peace','\n','\n','by','Leo','Tolstoy/Tolstoi']
Using .split()
I get this:
['War', 'and', 'Peace\n\nby', 'Leo', 'Tolstoy/Tolstoi']
So I started writing a program to put the \n as a separate entry after the word, code following:
for oldword in text:
counter = 0
newword = oldword
while "\n" in newword:
newword = newword.replace("\n","",1)
counter += 1
text[text.index(oldword)] = newword
while counter > 0:
text.insert(text.index(newword)+1, "\n")
counter -= 1
However, the program seems to hang on the line counter -= 1
, and I can't for the life of me figure out why.
NOTE: I realise that were this to work, the result would be ['Peaceby',"\n","\n"]; that is a different problem to be solved later.
You don't need such complicated way, You can simply use regex and re.findall()
to find all the words and new lines:
>>> s="""War and Peace
...
... by Leo Tolstoy/Tolstoi"""
>>>
>>> re.findall(r'\S+|\n',s)
['War', 'and', 'Peace', '\n', '\n', 'by', 'Leo', 'Tolstoy/Tolstoi']
'\S+|\n'
will match all the combinations of none whitespace character with length 1 or more (\S+
) or new line (\n
).
If you want to get the text from a file you can do the following:
with open('file_name') as f:
re.findall(r'\S+|\n',f.read())
Read more about regular expressions http://www.regular-expressions.info/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With