I'm writing a simple regular expression parser for the output of the sensors
utility on Ubuntu. Here's an example of a line of text I'm parsing:
temp1: +31.0°C (crit = +107.0°C)
And here's the regex I'm using to match that (in Python):
temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+'
r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')
This code works as expected and matches the example text I've given above. The only bits I'm really interested in are the numbers, so this bit:
(\+|-)(\d+\.\d+)\W\WC
which starts by matching the +
or -
sign and ends by matching the °C
.
My question is, why does it take two \W
(non-alphanumeric) characters to match °
rather than one? Will the code break on systems where Unicode is represented differently to mine? If so, how can I make it portable?
Possible portable solution:
Convert input data to unicode, and use re.UNICODE
flag in regular expressions.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
data = u'temp1: +31.0°C (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+'
ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)
print temp_re.findall(data)
Output
[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]
EDIT
@netvope allready pointed this out in comments for question.
Update
Notes from J.F. Sebastian comments about input encoding:
check_output()
returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u'°') == 176 so it can not be encoded using ASCII encoding.
So, to decode input data to unicode
, basically* you should use encoding from system locale using locale.getpreferredencoding()
e.g.:
data = subprocess.check_output(...).decode(locale.getpreferredencoding())
With data encoded correctly:
you'll get the same output without re.UNICODE in this case.
Why basically? Because on Russian Win7 with cp1251
as preferredencoding
if we have for example script.py
which decodes it's output to utf-8
:
#!/usr/bin/env python
# -*- coding: utf8 -*-
print u'temp1: +31.0°C (crit = +107.0°C)'.encode('utf-8')
And wee need to parse it's output:
subprocess.check_output(['python',
'script.py']).decode(locale.getpreferredencoding())
will produce wrong results: 'В°'
instead °
.
So you need to know encoding of input data, in some cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With