I am writing python script which is generating C++ code based on the data.
I have python variable string
which contains a string which can be composed
of characters like "
or newlines.
What is the best way to escape this string for code generation?
The way I use is based on the observation that C++ strings basically obey the same rules regarding charactes and escaping as Javascript/JSON string.
Python since version 2.6 has a built-in JSON library which can serialize Python data into JSON. Therefore, the code is, assuming we don't need enclosing quotes, just as follows:
import json
string_for_printing = json.dumps(original_string).strip('"')
I use this code, so far without problems:
def string(s, encoding='ascii'):
if isinstance(s, unicode):
s = s.encode(encoding)
result = ''
for c in s:
if not (32 <= ord(c) < 127) or c in ('\\', '"'):
result += '\\%03o' % ord(c)
else:
result += c
return '"' + result + '"'
It uses octal escapes to avoid all potentially problematic characters.
We can do better using specifics of C found here (https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#Character-Constants) and Python's built-in printable function:
def c_escape():
import string
mp = []
for c in range(256):
if c == ord('\\'): mp.append("\\\\")
elif c == ord('?'): mp.append("\\?")
elif c == ord('\''): mp.append("\\'")
elif c == ord('"'): mp.append("\\\"")
elif c == ord('\a'): mp.append("\\a")
elif c == ord('\b'): mp.append("\\b")
elif c == ord('\f'): mp.append("\\f")
elif c == ord('\n'): mp.append("\\n")
elif c == ord('\r'): mp.append("\\r")
elif c == ord('\t'): mp.append("\\t")
elif c == ord('\v'): mp.append("\\v")
elif chr(c) in string.printable: mp.append(chr(c))
else:
x = "\\%03o" % c
mp.append(x if c>=64 else (("\\%%0%do" % (1+c>=8)) % c, x))
return mp
This has the advantage of now being a mapping from ordinal value of a character ord(c)
to the exact string. Using +=
for strings is slow and bad practice, so this allows for "".join(...)
which is far more performant in Python. Not to mention, indexing a list/table is much faster than doing computations on characters each time through. Also do not waste octal characters either by checking if less characters are needed. However, to use this, you must verify the next character is not a 0
through 7
otherwise you must use the 3 digit octal format.
The table looks like:
[('\\0', '\\000'), ('\\1', '\\001'), ('\\2', '\\002'), ('\\3', '\\003'), ('\\4', '\\004'), ('\\5', '\\005'), ('\\6', '\\006'), '\\a', '\\b', '\\t', '\\n', '\\v', '\\f', '\\r', ('\\16', '\\016'), ('\\17', '\\017'), ('\\20', '\\020'), ('\\21', '\\021'), ('\\22', '\\022'), ('\\23', '\\023'), ('\\24', '\\024'), ('\\25', '\\025'), ('\\26', '\\026'), ('\\27', '\\027'), ('\\30', '\\030'), ('\\31', '\\031'), ('\\32', '\\032'), ('\\33', '\\033'), ('\\34', '\\034'), ('\\35', '\\035'), ('\\36', '\\036'), ('\\37', '\\037'), ' ', '!', '\\"', '#', '$', '%', '&', "\\'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '\\?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\\177', '\\200', '\\201', '\\202', '\\203', '\\204', '\\205', '\\206', '\\207', '\\210', '\\211', '\\212', '\\213', '\\214', '\\215', '\\216', '\\217', '\\220', '\\221', '\\222', '\\223', '\\224', '\\225', '\\226', '\\227', '\\230', '\\231', '\\232', '\\233', '\\234', '\\235', '\\236', '\\237', '\\240', '\\241', '\\242', '\\243', '\\244', '\\245', '\\246', '\\247', '\\250', '\\251', '\\252', '\\253', '\\254', '\\255', '\\256', '\\257', '\\260', '\\261', '\\262', '\\263', '\\264', '\\265', '\\266', '\\267', '\\270', '\\271', '\\272', '\\273', '\\274', '\\275', '\\276', '\\277', '\\300', '\\301', '\\302', '\\303', '\\304', '\\305', '\\306', '\\307', '\\310', '\\311', '\\312', '\\313', '\\314', '\\315', '\\316', '\\317', '\\320', '\\321', '\\322', '\\323', '\\324', '\\325', '\\326', '\\327', '\\330', '\\331', '\\332', '\\333', '\\334', '\\335', '\\336', '\\337', '\\340', '\\341', '\\342', '\\343', '\\344', '\\345', '\\346', '\\347', '\\350', '\\351', '\\352', '\\353', '\\354', '\\355', '\\356', '\\357', '\\360', '\\361', '\\362', '\\363', '\\364', '\\365', '\\366', '\\367', '\\370', '\\371', '\\372', '\\373', '\\374', '\\375', '\\376', '\\377']
Example usage encoding some 4-byte integers as C-strings in little-endian byte order with new lines inserted every 50 characters: v
mp = c_escape()
vals = [30,50,100]
bytearr = [z for i, x in enumerate(vals) for z in x.to_bytes(4, 'little', signed=x<0)]
"".join(mp[x] if not type(mp[x]) is tuple else mp[x][1 if not i == len(bytearr)-1 and bytearr[i+1] in list(range(ord('0'), ord('7')+1)) else 0] + ("\"\n\t\"" if (i % 50) == 49 else "") for i, x in enumerate(bytearr))
#output: '\\36\\0\\0\\0002\\0\\0\\0d\\0\\0\\0'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With