While trying to post the data extracted from a pdf file to a amazon cloud search domain for indexing, the indexing failed due to invalid chars in the data.
How can i remove these invalid charecters before posting to the search end point?
I tried escaping and replacing the chars, but didn't work.
I was getting an error like this when uploading document to CloudSearch (using aws sdk / json):
Error with source for field content_stemmed: Validation error for field 'content_stemmed': Invalid codepoint B
The solution for me, as documented by AWS (reference below), was to remove invalid characters from the document prior to uploading:
For example this is what I did using javascript:
const cleaned = someFieldValue.replace(
/[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/g,
''
)
ref:
Both JSON and XML batches can only contain UTF-8 characters that are valid in XML. Valid characters are the control characters tab (0009), carriage return (000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are invalid and will cause errors.
You can use the following regular expression to match invalid characters so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With