Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing invalid characters from amazon cloud search sdf

While trying to post the data extracted from a pdf file to a amazon cloud search domain for indexing, the indexing failed due to invalid chars in the data.

How can i remove these invalid charecters before posting to the search end point?

I tried escaping and replacing the chars, but didn't work.

like image 362
Quicksilver Avatar asked Sep 17 '25 18:09

Quicksilver


1 Answers

I was getting an error like this when uploading document to CloudSearch (using aws sdk / json):

Error with source for field content_stemmed: Validation error for field 'content_stemmed': Invalid codepoint B

The solution for me, as documented by AWS (reference below), was to remove invalid characters from the document prior to uploading:

For example this is what I did using javascript:

const cleaned = someFieldValue.replace(
  /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/g, 
  ''
)

ref:

Both JSON and XML batches can only contain UTF-8 characters that are valid in XML. Valid characters are the control characters tab (0009), carriage return (000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are invalid and will cause errors.

You can use the following regular expression to match invalid characters so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/

like image 131
joshweir Avatar answered Sep 20 '25 09:09

joshweir