Removing invalid characters from amazon cloud search sdf

Question

While trying to post the data extracted from a pdf file to a amazon cloud search domain for indexing, the indexing failed due to invalid chars in the data.

How can i remove these invalid charecters before posting to the search end point?

I tried escaping and replacing the chars, but didn't work.

joshweir · Accepted Answer

I was getting an error like this when uploading document to CloudSearch (using aws sdk / json):

Error with source for field content_stemmed: Validation error for field 'content_stemmed': Invalid codepoint B

The solution for me, as documented by AWS (reference below), was to remove invalid characters from the document prior to uploading:

For example this is what I did using javascript:

const cleaned = someFieldValue.replace(
  /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/g, 
  ''
)

ref:

Both JSON and XML batches can only contain UTF-8 characters that are valid in XML. Valid characters are the control characters tab (0009), carriage return (000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are invalid and will cause errors.

You can use the following regular expression to match invalid characters so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/

Removing invalid characters from amazon cloud search sdf

Tags:

amazon-web-services

amazon-cloudsearch

Quicksilver

1 Answers

joshweir

Recent Activity

Donate For Us

Removing invalid characters from amazon cloud search sdf

Tags:

amazon-web-services

amazon-cloudsearch

Quicksilver

1 Answers

joshweir

Related questions

Recent Activity

Donate For Us