Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to portably parse the (Unicode) degree symbol with regular expressions?

I'm writing a simple regular expression parser for the output of the sensors utility on Ubuntu. Here's an example of a line of text I'm parsing:

temp1:        +31.0°C  (crit = +107.0°C)

And here's the regex I'm using to match that (in Python):

temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+' 
                     r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')

This code works as expected and matches the example text I've given above. The only bits I'm really interested in are the numbers, so this bit:

(\+|-)(\d+\.\d+)\W\WC

which starts by matching the + or - sign and ends by matching the °C.

My question is, why does it take two \W (non-alphanumeric) characters to match ° rather than one? Will the code break on systems where Unicode is represented differently to mine? If so, how can I make it portable?

like image 444
snim2 Avatar asked Jan 21 '12 10:01

snim2


1 Answers

Possible portable solution:

Convert input data to unicode, and use re.UNICODE flag in regular expressions.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re


data = u'temp1:        +31.0°C  (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+' 
                     ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)

print temp_re.findall(data)

Output

[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]

EDIT

@netvope allready pointed this out in comments for question.

Update

Notes from J.F. Sebastian comments about input encoding:

check_output() returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u'°') == 176 so it can not be encoded using ASCII encoding.

So, to decode input data to unicode, basically* you should use encoding from system locale using locale.getpreferredencoding() e.g.:

data = subprocess.check_output(...).decode(locale.getpreferredencoding())

With data encoded correctly:

you'll get the same output without re.UNICODE in this case.


Why basically? Because on Russian Win7 with cp1251 as preferredencoding if we have for example script.py which decodes it's output to utf-8:

#!/usr/bin/env python
# -*- coding: utf8 -*-

print u'temp1: +31.0°C  (crit = +107.0°C)'.encode('utf-8')

And wee need to parse it's output:

subprocess.check_output(['python', 
                         'script.py']).decode(locale.getpreferredencoding())

will produce wrong results: 'В°' instead °.

So you need to know encoding of input data, in some cases.

like image 81
reclosedev Avatar answered Oct 06 '22 17:10

reclosedev