I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string? The Python string should in general be of type <code>unicode</code>—for instance, a <code>0x93</code> in Windows-1252 encoded input becomes a <code>u'\u0201c'</code>. I have attempted to use <code>PyString_Decode</code>, but it always fails when there are non-ASCII characters in the string. Here is an example that fails: <pre class="prettyprint"><code>#include <Python.h> #include <stdio.h> int main(int argc, char *argv[]) { char c_string[] = { (char)0x93, 0 }; PyObject *py_string; Py_Initialize(); py_string = PyString_Decode(c_string, 1, "windows_1252", "replace"); if (!py_string) { PyErr_Print(); return 1; } return 0; } </code></pre> The error message is <code>UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)</code>, which indicates that the <code>ascii</code> encoding is used even though we specify <code>windows_1252</code> in the call to <code>PyString_Decode</code>. The following code works around the problem by using <code>PyString_FromString</code> to create a Python string of the undecoded bytes, then calling its <code>decode</code> method: <pre class="prettyprint"><code>#include <Python.h> #include <stdio.h> int main(int argc, char *argv[]) { char c_string[] = { (char)0x93, 0 }; PyObject *raw, *decoded; Py_Initialize(); raw = PyString_FromString(c_string); printf("Undecoded: "); PyObject_Print(raw, stdout, 0); printf("\n"); decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252"); Py_DECREF(raw); printf("Decoded: "); PyObject_Print(decoded, stdout, 0); printf("\n"); return 0; } </code></pre>

PyString_Decode does this: <pre class="prettyprint"><code>PyObject *PyString_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors) { PyObject *v, *str; str = PyString_FromStringAndSize(s, size); if (str == NULL) return NULL; v = PyString_AsDecodedString(str, encoding, errors); Py_DECREF(str); return v; } </code></pre> IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails. I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like: <pre class="prettyprint"><code>#include <Python.h> #include <stdio.h> int main(int argc, char *argv[]) { char c_string[] = { (char)0x93, 0 }; PyObject *py_string, *py_unicode; Py_Initialize(); py_string = PyString_FromStringAndSize(c_string, 1); if (!py_string) { PyErr_Print(); return 1; } py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace"); Py_DECREF(py_string); return 0; } </code></pre> I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.

You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right? Just use <code>PyString_FromString</code>: <pre class="prettyprint"><code>char *cstring; PyObject *pystring = PyString_FromString(cstring); </code></pre> That's all. Now you have a Python <code>str()</code> object. See docs here: https://docs.python.org/2/c-api/string.html I'm a little bit confused about how to specify "str" or "unicode." They are quite different if you have non-ASCII characters. If you want to decode a C string and you know exactly what character set it's in, then yes, <code>PyString_DecodeString</code> is a good place to start.

Try calling <code>PyErr_Print()</code> in the "<code>if (!py_string)</code>" clause. Perhaps the python exception will give you some more information.

How to convert a C string (char array) into a Python string when there are non-ASCII characters in the string?

Tags:

python

c

character-encoding

embedding

I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?

The Python string should in general be of type unicode—for instance, a 0x93 in Windows-1252 encoded input becomes a u'\u0201c'.

I have attempted to use PyString_Decode, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string;

     Py_Initialize();

     py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     return 0;
}

The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128), which indicates that the ascii encoding is used even though we specify windows_1252 in the call to PyString_Decode.

The following code works around the problem by using PyString_FromString to create a Python string of the undecoded bytes, then calling its decode method:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *raw, *decoded;

     Py_Initialize();

     raw = PyString_FromString(c_string);
     printf("Undecoded: ");
     PyObject_Print(raw, stdout, 0);
     printf("\n");
     decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
     Py_DECREF(raw);
     printf("Decoded: ");
     PyObject_Print(decoded, stdout, 0);
     printf("\n");
     return 0;
}

480

asked Oct 17 '08 19:10

Vebjorn Ljosa

3 Answers

PyString_Decode does this:

PyObject *PyString_Decode(const char *s,
              Py_ssize_t size,
              const char *encoding,
              const char *errors)
{
    PyObject *v, *str;

    str = PyString_FromStringAndSize(s, size);
    if (str == NULL)
    return NULL;
    v = PyString_AsDecodedString(str, encoding, errors);
    Py_DECREF(str);
    return v;
}

IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.

I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string, *py_unicode;

     Py_Initialize();

     py_string = PyString_FromStringAndSize(c_string, 1);
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
     Py_DECREF(py_string);

     return 0;
}

I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.

179

answered Oct 19 '22 23:10

Tony Meyer

You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right?

Just use PyString_FromString:

char *cstring;
PyObject *pystring = PyString_FromString(cstring);

That's all. Now you have a Python str() object. See docs here: https://docs.python.org/2/c-api/string.html

I'm a little bit confused about how to specify "str" or "unicode." They are quite different if you have non-ASCII characters. If you want to decode a C string and you know exactly what character set it's in, then yes, PyString_DecodeString is a good place to start.

answered Oct 19 '22 23:10

Dan Lenski

Try calling PyErr_Print() in the "if (!py_string)" clause. Perhaps the python exception will give you some more information.

answered Oct 20 '22 00:10

Alex Coventry

Related questions
                            
                                Django new version 3.1, the settings file have some changes
                            
                                Find missing elements in a list created from a sequence of consecutive integers with duplicates in O(n)
                            
                                Addition between 'int' and custom class [duplicate]
                            
                                Find/replace in VS Code Jupyter Notebooks
                            
                                EmptyDataError: No columns to parse from file about streamlit
                            
                                Is there any way(by setting or extension) to view and use variables in VSCODE other than setting breakpoints. likes Spyder?
                            
                                Print the colname and rowname for values that meet certain condition
                            
                                Elementwise maximum of sparse Scipy matrix & vector with broadcasting
                            
                                SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary'
                            
                                Airflow webserver gettins valueError:Samesite
                            
                                Detect OS dark mode in Python
                            
                                How to check if a model is in train or eval mode in Pytorch?
                            
                                Combine two dictionaries with preference to one of them - [duplicate]
                            
                                If the f-string like string formatting available in Julia?
                            
                                TypeError: Cannot interpret '4' as a data type
                            
                                Using __new__ in inherited dataclasses
                            
                                determine the range of a value using a look up table
                            
                                How does Python 3.10 match compares 1 and True?
                            
                                Capture the contents of a regex and delete them, efficiently
                            
                                Using os.execvp in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With