Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python, regex split and special character

How can I split correctly a string containing a sentence with special chars using whitespaces as separator ? Using regex split method I cannot obtain the desired result.

Example code:

# -*- coding: utf-8 -*-
import re


s="La felicità è tutto" # "The happiness is everything" in italian
l=re.compile("(\W)").split(s)

print " s> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

The output is :

 s> La felicità è tutto
 wordlist> ['La', ' ', 'felicit', '\xc3', '', '\xa0', '', ' ', '', '\xc3', '', '\xa8', '', ' ', 'tutto']
 word> La
 word>  
 word> felicit
 word> Ã
 word> 
 word> ?
 word> 
 word>  
 word> 
 word> Ã
 word> 
 word> ?
 word> 
 word>  
 word> tutto

while I'm looking for an output like:

 s> La felicità è tutto
 wordlist> ['La', ' ', 'felicità', ' ', 'è', ' ', 'tutto']
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto

To be noted that s is a string that is returned from another method so I cannot force the encoding like

s=u"La felicità è tutto"

On official python documentation of Unicode and reg-ex I haven't found a satisfactory explanation.

Thanks.

Alessandro

like image 829
alexroat Avatar asked Mar 15 '09 11:03

alexroat


People also ask

How do you split a string with special characters in Python?

Method 1: Split multiple characters from string using re. split() This is the most efficient and commonly used method to split multiple characters at once. It makes use of regex(regular expressions) in order to do this.

What is re split () in Python?

The re. split() function splits the given string according to the occurrence of a particular character or pattern. Upon finding the pattern, this function returns the remaining characters from the string in a list.


1 Answers

Your regex should be (\s) instead of (\W) like this:

l = re.compile("(\s)").split(s)

The code above will give you the exact output you requested. However the following line makes more sense:

l = re.compile("\s").split(s)

which splits on whitespace characters and doesn't give you all the spaces as matches. You may need them though so I posted both answers.

like image 104
Andrew Hare Avatar answered Nov 15 '22 18:11

Andrew Hare