I'd like to know how to create a regular expression to delete whitespaces after a newline, for example, if my text is like this:
So she refused to ex-
change the feather and the rock be-
cause she was afraid.
how I can create something to get:
["so","she","refused","to","exchange", "the","feather","and","the","rock","because","she","was","afraid" ]
i've tried to use "replace("-\n","")" to try to get them together but i only get something like:
["be","cause"] and ["ex","change"]
Any suggestion? Thanks!!
import re
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''.lower()
s = re.sub(r'-\n\s*', '', s) # join hyphens
s = re.sub(r'[^\w\s]', '', s) # remove punctuation
print(s.split())
\s* means 0 or more spaces.
From what I can tell, Alex Hall's answer more adequately answers your question (both explicitly in that it's regex and implicitly in that it's adjusts capitalization and removes punctuation), but it jumped out as a good candidate for a generator.
Here, using a generator to join tokens popped from a stack-like list:
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''
def condense(lst):
while lst:
tok = lst.pop(0)
if tok.endswith('-'):
yield tok[:-1] + lst.pop(0)
else:
yield tok
print(list(condense(s.split())))
# Result:
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather',
# 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid.']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With