Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode in the standard TensorFlow format

Following the documentation here, I am trying to create features from unicode strings. Here is what the feature creation method looks like,

def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

This will raise an exception,

  File "/home/rklopfer/.virtualenvs/tf/local/lib/python2.7/site-packages/google/protobuf/internal/python_message.py", line 512, in init
    copy.extend(field_value)
  File "/home/rklopfer/.virtualenvs/tf/local/lib/python2.7/site-packages/google/protobuf/internal/containers.py", line 275, in extend
    new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
  File "/home/rklopfer/.virtualenvs/tf/local/lib/python2.7/site-packages/google/protobuf/internal/type_checkers.py", line 108, in CheckValue
    raise TypeError(message)
TypeError: u'Gross' has type <type 'unicode'>, but expected one of: (<type 'str'>,)

Naturally if I wrap the value in a str, it fails on the first actual unicode character it encounters.

like image 912
Russell Avatar asked Aug 15 '16 19:08

Russell


People also ask

What does Unicode () do in Python?

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.

Can you put Unicode in JSON?

JSON data always uses the Unicode character set. In this respect, JSON data is simpler to use than XML data. This is an important part of the JSON Data Interchange Format (RFC 4627).

What is difference between Unicode and utf8?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

Does Python use UTF-8 or UTF 16?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)


1 Answers

BytesList definition is in feature.proto and it is of type repeated bytes, this means that you need to pass it something that's convertible to a list of byte sequences.

There's more than one way to turn unicode into list of bytes, hence ambiguity. You could do it manually instead. IE, to use UTF-8 encoding

value.encode("utf-8")
like image 91
Yaroslav Bulatov Avatar answered Oct 12 '22 18:10

Yaroslav Bulatov