Following the documentation here, I am trying to create features from unicode strings. Here is what the feature creation method looks like,
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
This will raise an exception,
File "/home/rklopfer/.virtualenvs/tf/local/lib/python2.7/site-packages/google/protobuf/internal/python_message.py", line 512, in init
copy.extend(field_value)
File "/home/rklopfer/.virtualenvs/tf/local/lib/python2.7/site-packages/google/protobuf/internal/containers.py", line 275, in extend
new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
File "/home/rklopfer/.virtualenvs/tf/local/lib/python2.7/site-packages/google/protobuf/internal/type_checkers.py", line 108, in CheckValue
raise TypeError(message)
TypeError: u'Gross' has type <type 'unicode'>, but expected one of: (<type 'str'>,)
Naturally if I wrap the value
in a str
, it fails on the first actual unicode character it encounters.
If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.
JSON data always uses the Unicode character set. In this respect, JSON data is simpler to use than XML data. This is an important part of the JSON Data Interchange Format (RFC 4627).
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)
BytesList definition is in feature.proto and it is of type repeated bytes
, this means that you need to pass it something that's convertible to a list of byte sequences.
There's more than one way to turn unicode
into list of bytes, hence ambiguity. You could do it manually instead. IE, to use UTF-8
encoding
value.encode("utf-8")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With