How can I split correctly a string containing a sentence with special chars using whitespaces as separator ? Using regex split method I cannot obtain the desired result. Example code: <pre class="prettyprint"><code># -*- coding: utf-8 -*- import re s="La felicità è tutto" # "The happiness is everything" in italian l=re.compile("(\W)").split(s) print " s> "+s print " wordlist> "+str(l) for i in l: print " word> "+i </code></pre> The output is : <pre class="prettyprint"><code> s> La felicità è tutto wordlist> ['La', ' ', 'felicit', '\xc3', '', '\xa0', '', ' ', '', '\xc3', '', '\xa8', '', ' ', 'tutto'] word> La word> word> felicit word> Ã word> word> ? word> word> word> word> Ã word> word> ? word> word> word> tutto </code></pre> while I'm looking for an output like: <pre class="prettyprint"><code> s> La felicità è tutto wordlist> ['La', ' ', 'felicità', ' ', 'è', ' ', 'tutto'] word> La word> word> felicità word> word> è word> word> tutto </code></pre> To be noted that s is a string that is returned from another method so I cannot force the encoding like <pre class="prettyprint"><code>s=u"La felicità è tutto" </code></pre> On official python documentation of Unicode and reg-ex I haven't found a satisfactory explanation. Thanks. Alessandro

Your regex should be <code>(\s)</code> instead of <code>(\W)</code> like this: <pre class="prettyprint"><code>l = re.compile("(\s)").split(s) </code></pre> The code above will give you the exact output you requested. However the following line makes more sense: <pre class="prettyprint"><code>l = re.compile("\s").split(s) </code></pre> which splits on whitespace characters and doesn't give you all the spaces as matches. You may need them though so I posted both answers.

python, regex split and special character

Tags:

python

regex

split

unicode

How can I split correctly a string containing a sentence with special chars using whitespaces as separator ? Using regex split method I cannot obtain the desired result.

Example code:

# -*- coding: utf-8 -*-
import re


s="La felicità è tutto" # "The happiness is everything" in italian
l=re.compile("(\W)").split(s)

print " s> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

The output is :

 s> La felicità è tutto
 wordlist> ['La', ' ', 'felicit', '\xc3', '', '\xa0', '', ' ', '', '\xc3', '', '\xa8', '', ' ', 'tutto']
 word> La
 word>  
 word> felicit
 word> Ã
 word> 
 word> ?
 word> 
 word>  
 word> 
 word> Ã
 word> 
 word> ?
 word> 
 word>  
 word> tutto

while I'm looking for an output like:

 s> La felicità è tutto
 wordlist> ['La', ' ', 'felicità', ' ', 'è', ' ', 'tutto']
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto

To be noted that s is a string that is returned from another method so I cannot force the encoding like

s=u"La felicità è tutto"

On official python documentation of Unicode and reg-ex I haven't found a satisfactory explanation.

Thanks.

Alessandro

829

asked Mar 15 '09 11:03

alexroat

1 Answers

Your regex should be (\s) instead of (\W) like this:

l = re.compile("(\s)").split(s)

The code above will give you the exact output you requested. However the following line makes more sense:

l = re.compile("\s").split(s)

which splits on whitespace characters and doesn't give you all the spaces as matches. You may need them though so I posted both answers.

104

answered Nov 15 '22 18:11

Andrew Hare

Related questions
                            
                                Converting list of Arrays to list of Lists?
                            
                                Sum a list of Pandas DataFrames
                            
                                Specific way of requiring one of two fields in django model definition
                            
                                __str__ method not working when objects are inside a list or dict
                            
                                Pandas group by weekday (M/T/W/T/F/S/S)
                            
                                PyQt5 QImage from Numpy Array
                            
                                Alexa Skill Development using flask-ask and ngrok
                            
                                Python installer for Windows: disable path length limit option not available
                            
                                How to stop my pandas data table from being truncated when printed?
                            
                                Return Pandas multiindex as list of tuples?
                            
                                Store Excel file exported from Pandas in AWS
                            
                                Let's Encrypt certbot-auto fails because a Python / pip problem
                            
                                Append values from dataframe column to list
                            
                                How to find the last non zero element in every column throughout dataframe?
                            
                                What is wrong with the comparison a,b == 1,2? [duplicate]
                            
                                Changing the size of Altair plot renders in Jupyter notebook
                            
                                How to set title on Seaborn JointPlot?
                            
                                on colab - class_weight is causing a ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
                            
                                Automation Google login with python and selenium shows ""This browser or app may be not secure""
                            
                                Confusion about global variables in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With