I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?
The Python string should in general be of type unicode
—for instance, a 0x93
in Windows-1252 encoded input becomes a u'\u0201c'
.
I have attempted to use PyString_Decode
, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *py_string;
Py_Initialize();
py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
if (!py_string) {
PyErr_Print();
return 1;
}
return 0;
}
The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
, which indicates that the ascii
encoding is used even though we specify windows_1252
in the call to PyString_Decode
.
The following code works around the problem by using PyString_FromString
to create a Python string of the undecoded bytes, then calling its decode
method:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *raw, *decoded;
Py_Initialize();
raw = PyString_FromString(c_string);
printf("Undecoded: ");
PyObject_Print(raw, stdout, 0);
printf("\n");
decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
Py_DECREF(raw);
printf("Decoded: ");
PyObject_Print(decoded, stdout, 0);
printf("\n");
return 0;
}
Use str. join() to convert a list of characters into a string. Call str. join(iterable) with the empty string "" as str and a list as iterable to join each character into a string.
In order to use non-ASCII characters, Python requires explicit encoding and decoding of strings into Unicode. In IBM® SPSS® Modeler, Python scripts are assumed to be encoded in UTF-8, which is a standard Unicode encoding that supports non-ASCII characters.
Use str. isprintable() to strip non-printable characters from a string.
PyString_Decode does this:
PyObject *PyString_Decode(const char *s,
Py_ssize_t size,
const char *encoding,
const char *errors)
{
PyObject *v, *str;
str = PyString_FromStringAndSize(s, size);
if (str == NULL)
return NULL;
v = PyString_AsDecodedString(str, encoding, errors);
Py_DECREF(str);
return v;
}
IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.
I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *py_string, *py_unicode;
Py_Initialize();
py_string = PyString_FromStringAndSize(c_string, 1);
if (!py_string) {
PyErr_Print();
return 1;
}
py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
Py_DECREF(py_string);
return 0;
}
I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.
You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right?
Just use PyString_FromString
:
char *cstring;
PyObject *pystring = PyString_FromString(cstring);
That's all. Now you have a Python str()
object. See docs here: https://docs.python.org/2/c-api/string.html
I'm a little bit confused about how to specify "str" or "unicode." They are quite different if you have non-ASCII characters. If you want to decode a C string and you know exactly what character set it's in, then yes, PyString_DecodeString
is a good place to start.
Try calling PyErr_Print()
in the "if (!py_string)
" clause. Perhaps the python exception will give you some more information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With