I'm getting this error in my python program: <code>ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters</code> This question, random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes, explains the issue. The solution was to filter out certain bytes, but I'm confused about how to go about doing this. Any help? Edit: sorry if i didn't give enough info about the problem. the string data comes from an external api query of which i have no control over the how the data is formatted.

Another approach that's much faster than the answer above is to use regular expressions, like so: <pre class="prettyprint"><code>re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', text) </code></pre> Comparing to the answer above, it comes out to be more than 10X faster in my testing: <pre class="prettyprint"><code>import timeit func_test = """ def valid_xml_char_ordinal(c): codepoint = ord(c) # conditions ordered by presumed frequency return ( 0x20 <= codepoint <= 0xD7FF or codepoint in (0x9, 0xA, 0xD) or 0xE000 <= codepoint <= 0xFFFD or 0x10000 <= codepoint <= 0x10FFFF ); ''.join(c for c in r.content if valid_xml_char_ordinal(c)) """ func_setup = """ import requests; r = requests.get("https://stackoverflow.com/questions/8733233/") """ regex_test = """re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', r.content)""" regex_setup = """ import requests, re; r = requests.get("https://stackoverflow.com/questions/8733233/") """ func_test = timeit.Timer(func_test, setup=func_setup) regex_test = timeit.Timer(regex_test, setup=regex_setup) print func_test.timeit(100) print regex_test.timeit(100) </code></pre> Output: <pre class="prettyprint"><code>> 2.63773989677 > 0.221401929855 </code></pre> So, making sense of that, what we're doing is downloading this webpage once (the page you're currently reading), then running the functional technique and the regex technique over its contents 100X each. Using the functional method takes about 2.6 seconds. Using the regex method takes about 0.2 seconds. <hr> Update: As identified in the comments, the regex in this answer previously deleted some characters, which should have been allowed in XML. These characters include anything in the Supplementary Multilingual Plane, which is includes ancient scripts like cuneiform, hieroglyphics, and (weirdly) emojis. The correct regex is now above. A quick test for this in the future is using <code>re.DEBUG</code>, which prints: <pre class="prettyprint"><code>In [52]: re.compile(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', re.DEBUG) max_repeat 1 4294967295 in negate None range (32, 55295) literal 9 literal 10 literal 13 range (57344, 65533) range (65536, 1114111) Out[52]: re.compile(ur'[^ -\ud7ff\t\n\r\ue000-\ufffd\U00010000-\U0010ffff]+', re.DEBUG) </code></pre> My apologies for the error. I can only offer that I found this answer elsewhere and put it in here. It was somebody else's error, but I propagated it. My sincere apologies to anybody this affected. Update 2, 2017-12-12: I've learned from some OSX users that this code won't work on so-called narrow builds of Python, which apparently OSX sometimes has. You can check this by running <code>import sys; sys.maxunicode</code>. If it prints 65535, the code here won't work until you install a "wide build". See more about this here.

Filtering out certain bytes in python

3 Answers

As the answer to the linked question said, the XML standard defines a valid character as:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Translating that into Python:

def valid_xml_char_ordinal(c):
    codepoint = ord(c)
    # conditions ordered by presumed frequency
    return (
        0x20 <= codepoint <= 0xD7FF or
        codepoint in (0x9, 0xA, 0xD) or
        0xE000 <= codepoint <= 0xFFFD or
        0x10000 <= codepoint <= 0x10FFFF
        )

You can then use that function however you need to, e.g.

cleaned_string = ''.join(c for c in input_string if valid_xml_char_ordinal(c))

176

answered Sep 21 '22 08:09

John Machin

Another approach that's much faster than the answer above is to use regular expressions, like so:

re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', text)

Comparing to the answer above, it comes out to be more than 10X faster in my testing:

import timeit

func_test = """
def valid_xml_char_ordinal(c):
    codepoint = ord(c)
    # conditions ordered by presumed frequency
    return (
        0x20 <= codepoint <= 0xD7FF or
        codepoint in (0x9, 0xA, 0xD) or
        0xE000 <= codepoint <= 0xFFFD or
        0x10000 <= codepoint <= 0x10FFFF
    );
''.join(c for c in r.content if valid_xml_char_ordinal(c))
"""

func_setup = """
import requests; 
r = requests.get("https://stackoverflow.com/questions/8733233/")
"""

regex_test = """re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', r.content)"""
regex_setup = """
import requests, re; 
r = requests.get("https://stackoverflow.com/questions/8733233/")
"""

func_test = timeit.Timer(func_test, setup=func_setup)
regex_test = timeit.Timer(regex_test, setup=regex_setup)

print func_test.timeit(100)
print regex_test.timeit(100)

Output:

> 2.63773989677
> 0.221401929855

So, making sense of that, what we're doing is downloading this webpage once (the page you're currently reading), then running the functional technique and the regex technique over its contents 100X each.

Using the functional method takes about 2.6 seconds.
Using the regex method takes about 0.2 seconds.

Update: As identified in the comments, the regex in this answer previously deleted some characters, which should have been allowed in XML. These characters include anything in the Supplementary Multilingual Plane, which is includes ancient scripts like cuneiform, hieroglyphics, and (weirdly) emojis.

The correct regex is now above. A quick test for this in the future is using re.DEBUG, which prints:

In [52]: re.compile(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', re.DEBUG)
max_repeat 1 4294967295
  in
    negate None
    range (32, 55295)
    literal 9
    literal 10
    literal 13
    range (57344, 65533)
    range (65536, 1114111)
Out[52]: re.compile(ur'[^ -\ud7ff\t\n\r\ue000-\ufffd\U00010000-\U0010ffff]+', re.DEBUG)

My apologies for the error. I can only offer that I found this answer elsewhere and put it in here. It was somebody else's error, but I propagated it. My sincere apologies to anybody this affected.

Update 2, 2017-12-12: I've learned from some OSX users that this code won't work on so-called narrow builds of Python, which apparently OSX sometimes has. You can check this by running import sys; sys.maxunicode. If it prints 65535, the code here won't work until you install a "wide build". See more about this here.

answered Sep 20 '22 08:09

mlissner

I think this is harsh/overkill and it seems painfully slow, but my program is still quick and after struggling to comprehend what was going wrong (even after I attempted to implement @John's cleaned_string implementation), I just adapted his answer to purge ASCII-unprintable using the following (Python 2.7):

from curses import ascii
def clean(text):
    return str(''.join(
            ascii.isprint(c) and c or '?' for c in text
            ))

I'm not sure what I did wrong with the better option, but I just wanted to move on...

answered Sep 23 '22 08:09

sage

Related questions
                            
                                python module import - relative paths issue
                            
                                python version 3.4 does not support a 'ur' prefix
                            
                                Appropriate choice of authentication class for python REST API used by web app
                            
                                How do I make Python3 the default Python in Geany
                            
                                Change Tkinter Frame Title [duplicate]
                            
                                Django Localhost CORS not working
                            
                                dill vs cPickle speed difference
                            
                                using import inside class
                            
                                Pandas round is not working for DataFrame
                            
                                Python Gmail API 'not JSON serializable'
                            
                                Tensorflow Deep MNIST: Resource exhausted: OOM when allocating tensor with shape[10000,32,28,28]
                            
                                How to get parent folder name of current directory?
                            
                                How to remove special characters except space from a file in python?
                            
                                Install PyTorch from requirements.txt
                            
                                How can I parse HTML with html5lib, and query the parsed HTML with XPath?
                            
                                Python list comprehension overriding value
                            
                                Decorator that prints function call details (parameters names and effective values)?
                            
                                How to run sudo with Paramiko? (Python)
                            
                                Get IP address of url in python? [duplicate]
                            
                                Run Python Script on Selected File

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filtering out certain bytes in python

Tags:

python

text

xml

unicode

lxml

y3di

People also ask

3 Answers

John Machin

mlissner

sage

Recent Activity

Donate For Us