Python: Convert utf-8 string to byte string

Tags:

I have the following function to parse a utf-8 string from a sequence of bytes

Note -- 'length_size' is the number of bytes it take to represent the length of the utf-8 string

def parse_utf8(self, bytes, length_size):

    length = bytes2int(bytes[0:length_size])
    value = ''.join(['%c' % b for b in bytes[length_size:length_size+length]])
    return value


def bytes2int(raw_bytes, signed=False):
    """
    Convert a string of bytes to an integer (assumes little-endian byte order)
    """
    if len(raw_bytes) == 0:
        return None
    fmt = {1:'B', 2:'H', 4:'I', 8:'Q'}[len(raw_bytes)]
    if signed:
        fmt = fmt.lower()
    return struct.unpack('<'+fmt, raw_bytes)[0]

I'd like to write the function in reverse -- i.e. a function that will take a utf-8 encoded string and return it's representation as a byte string.

So far, I have the following:

def create_utf8(self, utf8_string):
    return utf8_string.encode('utf-8')

I run into the following error when attempting to test it:

  File "writer.py", line 229, in create_utf8
return utf8_string.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x98 in position 0: ordinal not in range(128)

If possible, I'd like to adopt a structure for the code similar to the parse_utf8 example. What am I doing wrong?

Thank you for your help!

UPDATE: test driver, now correct

def random_utf8_seq(self, length):
    # from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
    test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂăĄąĆćČčĎďĐđĘęĚěĹĺĽľŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"

    utf8_seq = u""

    for i in range(length):
        utf8_seq += random.choice(test_charset)

    return utf8_seq

I get the following error:

input_str = self.random_utf8_seq(200)
  File "writer.py", line 226, in random_utf8_seq
print unicode(utf8_seq, "utf-8")
  UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 0: invalid start byte

560

asked Feb 09 '14 22:02

mythander889

1 Answers

If utf-8 => bytestring conversion is what do you want then you may use str.encode, but first you need to properly mark the type of source string in your example - prefix with u for unicode:

# coding: utf-8
import random

    def random_utf8_seq(length):
        # from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
        test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂăĄąĆćČčĎďĐđĘęĚěĹĺĽľŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"

        utf8_seq = u''

        for i in range(length):
            utf8_seq += random.choice(test_charset)

        print utf8_seq.encode('utf-8')
        return utf8_seq.encode('utf-8')

    print( type(random_utf8_seq(200)) )

-- output --

õ3×sÔP{Ć.s(Ë°˙ě÷xÓ@bűV—û´ő¢uZÓČn˜0|_"Ðyø`êš·ÏÝhunÍÅ=ä?
óP{tlÇűpb¸7s´ňƒG—čøň\zčłŢXÂYqLĆúěă(ÿî ¥PyÐÔŇn×œ¦Ì˝+•ì›
ŻÛ°Ñ^ÝC÷ŢŐIñJĹţÒył"MťÆ‹ČČ4þ!»šåŮ@Öhň-
ÈLGĄ¢ß˛Đ¯.ªÆź˘Ř^ĽÛŹËaĂŕ¹#¢éüÜńlÊqš=VřU…‚–MŽÎÉèoÙŹŠ¨Ð
<type 'str'>

142

answered Oct 28 '22 11:10

David Unric

Related questions
                            
                                Groupby - taking last element - how do I keep nan's?
                            
                                What Pandas data type is passed to transform or apply in a groupby
                            
                                Find a string and insert text after it in Python
                            
                                celery: "Substantial drift from"
                            
                                Which is faster: x*x or x**2?
                            
                                How do I use variables in a loop with range()? (Python)
                            
                                Python-Predicting/Extrapolating future data given a data set
                            
                                Django-CMS template blocks
                            
                                pandas binning a list based on qcut of another list
                            
                                Can't sign cloudfront URLs using boto
                            
                                How to integrate flask and flask_sockets into a single app running under uwsgi
                            
                                django-autocomplete-light default load a previously saved value?
                            
                                How to tell pylint that sub-classes of a composed class have access to the parent members?
                            
                                How to use __repr__ to create new object from it?
                            
                                How to kill a subprocess initiated by a different function in the same class
                            
                                WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
                            
                                python downloading is extremely slow
                            
                                Minimizing the performance issues of loading a many to many relationship
                            
                                cannot import name get_user_model
                            
                                How do I conditionally include a file in a Sphinx 'toctree'? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: Convert utf-8 string to byte string

Tags:

python

string

encoding

utf-8

mythander889

People also ask

1 Answers

David Unric

Recent Activity

Donate For Us