How can I split correctly a string containing a sentence with special chars using whitespaces as separator ? Using regex split method I cannot obtain the desired result.
Example code:
# -*- coding: utf-8 -*-
import re
s="La felicità è tutto" # "The happiness is everything" in italian
l=re.compile("(\W)").split(s)
print " s> "+s
print " wordlist> "+str(l)
for i in l:
print " word> "+i
The output is :
s> La felicità è tutto
wordlist> ['La', ' ', 'felicit', '\xc3', '', '\xa0', '', ' ', '', '\xc3', '', '\xa8', '', ' ', 'tutto']
word> La
word>
word> felicit
word> Ã
word>
word> ?
word>
word>
word>
word> Ã
word>
word> ?
word>
word>
word> tutto
while I'm looking for an output like:
s> La felicità è tutto
wordlist> ['La', ' ', 'felicità', ' ', 'è', ' ', 'tutto']
word> La
word>
word> felicità
word>
word> è
word>
word> tutto
To be noted that s is a string that is returned from another method so I cannot force the encoding like
s=u"La felicità è tutto"
On official python documentation of Unicode and reg-ex I haven't found a satisfactory explanation.
Thanks.
Alessandro
Method 1: Split multiple characters from string using re. split() This is the most efficient and commonly used method to split multiple characters at once. It makes use of regex(regular expressions) in order to do this.
The re. split() function splits the given string according to the occurrence of a particular character or pattern. Upon finding the pattern, this function returns the remaining characters from the string in a list.
Your regex should be (\s)
instead of (\W)
like this:
l = re.compile("(\s)").split(s)
The code above will give you the exact output you requested. However the following line makes more sense:
l = re.compile("\s").split(s)
which splits on whitespace characters and doesn't give you all the spaces as matches. You may need them though so I posted both answers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With