So you're working with Solr, reading data from it, Doing Things to that data, and saving the updates. It Works! Ship it! Then (in testing, thank the FSM) you start to get some odd failures. Sometimes it works, sometimes the Solr server returns a 400 or 500 error. Whisky Tango Foxtrot?
Say it's a bookstore app. An international bookstore. So you've got multiple code pages here. Some titles in Spanish, some in Hebrew. The app itself is in American English. So your field names are English, titles and other text in Cyrillic, or for character ordering fun in Hebrew. You notice that one (but not all) of the titles in Hebrew is causing problems.
The process you are following is: Query Solr to get a record, update the record, and write the entire record back to Solr. You're updating the "Count" field from "5" to "4". Some titles update, some fail. Googling reveals all sorts of possible red herrings: Is it a byte order mark problem? UTF8 control characters? Misconfiguration? Maybe. But.
Given a document update that looks like this:
<add>
<doc>
<field name="StockNumber">1</field>
<field name="Count">5</field>
<field name="Title">רוקד עם זאבים</field>
<field name="Translated_Title">Dances With Smurfs</field>
<field name="Summary">Our Hero goes to another place, bonds with the Odd Looking Natives, & saves the day.</field>
</doc>
</add>
The problem is in the "Summary" field. Specifically, the "&". It has to be URL encoded to "&", otherwise the word following it is interpreted as a command, not as part of the update. Note that it was returned by a query to Solr as "&" not as "&" So you can't just accept data returned from a query to Solr as being in the proper form for updating Solr. Of course if you URL encode every field you read from Solr before you write it back you will mangle it badly, as the Hebrew (in our example) will be stored in it's hexadecimal form, and then returned in that form (not as Hebrew) on future queries.
Solr will, however, store the "&" as "&".
< and > have the same problems.
Try sending everything between CDATA tags from your client application. Like:
<add>
<doc>
<field name="StockNumber"><![CDATA[1]]></field>
<field name="Count"><![CDATA[5]]></field>
<field name="Title"><![CDATA[רוקד עם זאבים]]></field>
<field name="Translated_Title"><![CDATA[Dances With Smurfs]]></field>
<field name="Summary"><![CDATA[Our Hero goes to another place, bonds with the Odd Looking Natives, & saves the day.]]></field>
</doc>
</add>
Of course it's not necessary for integer fields, but if you are dynamically constructing the document from an application, using it always is easier.
The only warning is to make sure the text does not contain a CDATA tag already. Double CDATA will cause trouble everywhere.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With