Here's a line from a .txt file I'm reading in, and I'm assigning it to x:
x = "Wild_lions live mostly in “Africa”"
result = re.split('[^a-zA-Z0-9]+', x)
I end up getting:
['Wild', 'lions', 'live', 'mostly', 'in', 'Africa', ''] # (there's an empty space character as the last element)
Why is there an empty space at the end? I realize I can just do result.remove(' ') to get rid of the space, but for large files I think this would be pretty inefficient.
You don't need to use this complex regex to split by it, the simpler is:
result = re.split('\s+', x)
result
# ['Wild_lions', 'live', 'mostly', 'in', '“Africa”']
The \s+ will match any number of any whitespaces (tabs, spaces, newlines etc).
In case you need only alphabetical match, it's better to use re.compile with findall.
myre = re.compile('[a-zA-Z]+')
myre.findall(x)
# ['Wild', 'lions', 'live', 'mostly', 'in', 'Africa']
try this:
x = "Wild_lions live mostly in 'Africa'"
result = re.split('[\s_]+', x)
You'll get:
['Wild', 'lions', 'live', 'mostly', 'in', "'Africa'"]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With