Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL - Escaping ampersand (&) in fulltext searches

We are using a fulltext search to search for the name of a company and all is going well until we have a company with an ampersand in its name, e.g. 'M&S'.

SELECT name FROM company WHERE MATCH (name) against ('M&S' IN BOOLEAN MODE);

This fails to return any results as MySQL is treating the ampersand as a boolean operator. The boolean mode is desired so it can't simply be turned off.

What I'm looking for is a way to escape the ampersand so that MySQL treats it correctly and finds the record.

Ditching fulltext search in favour of LIKEs isn't exactly an option either

Thanks for your help

like image 413
Rob Forrest Avatar asked Jun 16 '14 07:06

Rob Forrest


1 Answers

Seems like & isn't considered a word character in the collation you use for your fulltext search.

so you have to create your own collation (or recompile your MySQL server) where you add & to the list of word characters like i found out in the MySQL docs ( http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html) :

If you want to change the set of characters that are considered word characters, you can do so in several ways, as described in the following list. After making the modification, you must rebuild the indexes for each table that contains any FULLTEXT indexes. Suppose that you want to treat the hyphen character ('-') as a word character. Use one of these methods:

Modify the MySQL source: In myisam/ftdefs.h, see the true_word_char() and misc_word_char() macros. Add '-' to one of those macros and recompile MySQL.

Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. . You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes. For information about the array format, see Section 10.3.1, “Character Definition Arrays”.

Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation. For general information about adding collations, see Section 10.4, “Adding a Collation to a Character Set”. For an example specific to full-text indexing, see Section 12.9.7, “Adding a Collation for Full-Text Indexing”.

UPDATE: in case you are using latin1 collation, open your XML file which is at mysql/share/charsets/latin1.xml. and find the corresponding character code in a map - in this case you can take the map for lower case or upper case because this doesn't matter for the ampersand symbol:

<lower>
<map>
 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
 40 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
 70 71 72 73 74 75 76 77 78 79 7A 5B 5C 5D 5E 5F
 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
 B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
 F0 F1 F2 F3 F4 F5 F6 D7 F8 F9 FA FB FC FD FE DF
 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF
</map>
</lower>

the ampersand's unicode is U+0026 and in utf-8 encoding it's 0x26, so search for 26 in the map - which is in the 3rd row, 7th column.

then in the ctype-map change the type of the character from 10 which means punctuation to 01 which means small letter:

<ctype>
<map>
 00
 20 20 20 20 20 20 20 20 20 28 28 28 28 28 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 48 10 10 10 10 10 01 10 10 10 10 10 10 10 10 10
 84 84 84 84 84 84 84 84 84 84 10 10 10 10 10 10
 10 81 81 81 81 81 81 01 01 01 01 01 01 01 01 01
 01 01 01 01 01 01 01 01 01 01 01 10 10 10 10 10
 10 82 82 82 82 82 82 02 02 02 02 02 02 02 02 02
 02 02 02 02 02 02 02 02 02 02 02 10 10 10 10 20
 10 00 10 02 10 10 10 10 10 10 01 10 01 00 01 00
 00 10 10 10 10 10 10 10 10 10 02 10 02 00 02 01
 48 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
 01 01 01 01 01 01 01 10 01 01 01 01 01 01 01 02
 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
 02 02 02 02 02 02 02 10 02 02 02 02 02 02 02 02
</map>
</ctype>

restart your MySQL server and the corresponding collation is handling & like it was a small letter.

of course it's better to first copy and rename your new collation XML-file and to also copy and paste the corresponding lines in the Index.xml (don't forget to use a new unused id in the XML tags there) and link them to your new collation XML-file so you don't lose your original collation.

you can find the full documentation where i got most of the information from here: http://dev.mysql.com/doc/refman/5.0/en/full-text-adding-collation.html

Note - For all those working with Mysql 5.7 version use an unused collation id. The mysql article http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html is for Mysql 5.5 version. To get maximum collation Id use following Query -

   SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
like image 92
low_rents Avatar answered Nov 14 '22 00:11

low_rents