Is there a best practice for normalizing British and American English in Elasticsearch?
Using a Synonym Token Filter requires an incredibly long configuration file. There are actually several thousand differently spelled words in UK and US English and it's almost impossible to find a really comprehensive list of words. Here's a list of almost 2.000 words, but it's far from being complete.
Preferably, I'd like to create an ES Analyzer/Filter with rules to transform US to UK English. Maybe that's the better approach, but I don't know where to start - which type of filters do I need for that? It doesn't have to cover everything - it should merely normalize most search terms. E.g. "grey" - "gray", "colour" - "color", "center" - "centre", etc.
Here's the approach I went for after fiddling around a while. It's a combination of basic rules, "fixes", and synonyms: First, apply a char_filter to enforce a set of basic spelling rules. It's not 100% correct, but it does the job pretty well:
"char_filter": {
"en_char_filter": { "type": "mapping", "mappings": [
# fixes
"aerie=>axerie", "aeroplane=>airplane", "aloe=>aloxe", "canoe=>canoxe", "coerce=>coxerce", "poem=>poxem", "prise=>prixse",
# whole words
"armour=>armor", "behaviour=>behavior", "centre=>center" "colour=>color", "clamour=>clamor", "draught=>draft", "endeavour=>endeavor", "favour=>favor", "flavour=>flavor", "harbour=>harbor", "honour=>honor",
"humour=>humor", "labour=>labor", "litre=>liter", "metre=>meter", "mould=>mold", "neighbour=>neighbor", "plough=>plow", "saviour=>savior", "savour=>savor",
# generic transformations
"ae=>e", "ction=>xion", "disc=>disk", "gramme=>gram", "isable=>izable", "isation=>ization", "ise=>ize", "ising=>izing", "ll=>l", "oe=>e", "ogue=>og", "sation=>zation", "yse=>yze", "ysing=>yzing"
] }
}
The "fixes" entry is there to prevent incorrect application of other rules. E.g. "prise=>prixse"
prevents "prise" from getting changed into "prize", which has a different meaning. You may need to adapt this according to your own needs.
Next, include a synonym filter for catching the most frequently used exceptions:
"en_synonym_filter": { "type": "synonym", "synonyms": EN_SYNONYMS }
Here's our list of synonyms that includes the most important keywords for our use case. You may wish to adapt this list to your needs:
EN_SYNONYMS = (
"accolade, prize => award",
"accoutrement => accouterment",
"aching, pain => hurt",
"acw, anticlockwise, counterclockwise, counter-clockwise => ccw",
"adaptor => adapter",
"advocate, attorney, barrister, procurator, solicitor => lawyer",
"ageing => aging",
"agendas, agendum => agenda",
"almanack => almanac",
"aluminium => aluminum",
"america, united states, usa",
"amphitheatre => amphitheater",
"anti-aliased, anti-aliasing => antialiased",
"arbour => arbor",
"ardour => ardor",
"arse => ass",
"artefact => artifact",
"aubergine => eggplant",
"automobile, motorcar => car",
"axe => ax",
"bannister => banister",
"barbecue => bbq",
"battleaxe => battleax",
"baulk => balk",
"beetroot => beet",
"biassed => biased",
"biassing => biasing",
"biscuit => cookie",
"black american, african american, afro-american, negro",
"bobsleigh => bobsled",
"bonnet => hood",
"bulb, electric bulb, light bulb, lightbulb",
"burned => burnt",
"bussines, bussiness => business",
"business man, business people, businessman",
"business woman, business people, businesswoman",
"bussing => busing",
"cactus, cactuses => cacti",
"calibre => caliber",
"candour => candor",
"candy floss, candyfloss, cotton candy",
"car park, parking area, parking ground, parking lot, parking-lot, parking place, parking",
"carburettor => carburetor",
"castor => caster",
"cataloguing => cataloging",
"catboat, sailboat, sailing boat",
"champion, gainer, victor, win, winner => victory",
"chat => talk",
"chequebook => checkbook",
"chequer => checker",
"chequerboard => checkerboard",
"chequered => checkered",
"christmas tree ball, christmas tree ball ornament, christmas ball ornament, christmas bauble",
"christmas, x-mas => xmas",
"cinema => movies",
"clangour => clangor",
"clarinettist => clarinetist",
"conditioning => conditioner",
"conference => meeting",
"coriander => cilantro",
"corporate => company",
"cosmos, universe => outer space",
"cosy, cosiness => cozy",
"criminal => crime",
"curriculums => curricula",
"cypher => cipher",
"daddy, father, pa, papa => dad",
"defence => defense",
"defenceless => defenseless",
"demeanour => demeanor",
"departure platform, station platform, train platform, train station",
"dishrag => dish cloth",
"dishtowel, dishcloth => dish towel",
"doughnut => donut",
"downspout => drainpipe",
"drugstore => pharmacy",
"e-mail => email",
"enamoured => enamored",
"england => britain",
"english => british",
"epaulette => epaulet",
"exercise, excercise, training, workout => fitness",
"expressway, motorway, highway => freeway",
"facebook => facebook, social media",
"fanny => buttocks",
"fanny pack => bum bag",
"farmyard => barnyard",
"faucet => tap",
"fervour => fervor",
"fibre => fiber",
"fibreglass => fiberglass",
"flashlight => torch",
"flautist => flutist",
"flier => flyer",
"flower fly, hoverfly, syrphid fly, syrphus fly",
"foot-walk, sidewalk, sideway => pavement",
"football, soccer",
"forums => fora",
"fourth => 4",
"freshman => fresher",
"chips, fries, french fries",
"gaol => jail",
"gaolbird => jailbird",
"gaolbreak => jailbreak",
"gaoler => jailer",
"garbage, rubbish => trash",
"gasoline => petrol",
"gases, gasses",
"gauge => gage",
"gauged => gaged",
"gauging => gaging",
"gipsy, gipsies, gypsies => gypsy",
"glamour => glamor",
"glueing => gluing",
"gravesite, sepulchre, sepulture => sepulcher",
"grey => gray",
"greyish => grayish",
"greyness => grayness",
"groyne => groin",
"gryphon, griffon => griffin",
"hand shake, shake hands, shaking hands, handshake",
"haulier => hauler",
"hobo, homeless, tramp => bum",
"new year, new year's eve, hogmanay, silvester, sylvester",
"holiday => vacation",
"holidaymaker, holiday-maker, vacationer, vacationist => tourist",
"homosexual, fag => gay",
"inbox, letterbox, outbox, postbox => mailbox",
"independence day, 4th of july, fourth of july, july 4th, july 4, 4th july, july fourth, forth of july, 4 july, fourth july, 4th july",
"infant, suckling, toddler => baby",
"infeasible => unfeasible",
"inquire, inquiry => enquire",
"insure => ensure",
"internet, website => www",
"jelly => jam",
"jewelery, jewellery => jewelry",
"jogging => running",
"journey => travel",
"judgement => judgment",
"kerb => curb",
"kiwifruit => kiwi",
"laborer => worker",
"lacklustre => lackluster",
"ladybeetle, ladybird, ladybug => ladybird beetle",
"larrikin, scalawag, rascal, scallywag => naughty boy",
"leaf => leaves",
"licence, licenced, licencing => license",
"liquorice => licorice",
"lorry => truck",
"loupe, magnifier, magnifying, magnifying glass, magnifying lens, zoom",
"louvred => louvered",
"louvres => louver",
"lustre => luster",
"mail => post",
"mailman => postman",
"marriage, married, marry, marrying, wedding => wed",
"mayonaise => mayo",
"meagre => meager",
"misdemeanour => misdemeanor",
"mitre => miter",
"mom, momma, mummy, mother => mum",
"moonlight => moon light",
"moult => molt",
"moustache, moustached => mustache",
"nappy => diaper",
"nightlife => night life",
"normalcy => normality",
"octopus => kraken",
"odour => odor",
"odourless => odorless",
"offence => offense",
"omelette => omelet",
"# fix torres del paine",
"paine => painee",
"pajamas => pyjamas",
"pantyhose => tights",
"parenthesis, parentheses => bracket",
"parliament => congress",
"parlour => parlor",
"persnickety => pernickety",
"philtre => filter",
"phoney => phony",
"popsicle => iced-lolly",
"porch => veranda",
"pretence => pretense",
"pullover, jumper => sweater",
"pyjama => pajama",
"railway => railroad",
"rancour => rancor",
"rappel => abseil",
"row house, serial house, terrace house, terraced house, terraced housing, town house",
"rigour => rigor",
"rumour => rumor",
"sabre => saber",
"saltpetre => saltpeter",
"sanitarium => sanatorium",
"santa, santa claus, st nicholas, st nicholas day",
"sceptic, sceptical, scepticism, sceptics => skeptic",
"sceptre => scepter",
"shaikh, sheikh => sheik",
"shivaree => charivari",
"silverware, flatware => cutlery",
"simultaneous => simultanous",
"sleigh => sled",
"smoulder, smouldering => smolder",
"sombre => somber",
"speciality => specialty",
"spectre => specter",
"splendour => splendor",
"spoilt => spoiled",
"street => road",
"streetcar, tramway, tram => trolley-car",
"succour => succor",
"sulphate, sulphide, sulphur, sulphurous, sulfurous => sulfur",
"super hero, superhero => hero",
"surname => last name",
"sweets => candy",
"syphon => siphon",
"syphoning => siphoning",
"tack, thumb-tack, thumbtack => drawing pin",
"tailpipe => exhaust pipe",
"taleban => taliban",
"teenager => teen",
"television => tv",
"thank you, thanks",
"theatre => theater",
"tickbox => checkbox",
"ticked => checked",
"timetable => schedule",
"tinned => canned",
"titbit => tidbit",
"toffee => taffy",
"tonne => ton",
"transportation => transport",
"trapezium => trapezoid",
"trousers => pants",
"tumour => tumor",
"twitter => twitter, social media",
"tyre => tire",
"tyres => tires",
"undershirt => singlet",
"university => college",
"upmarket => upscale",
"valour => valor",
"vapour => vapor",
"vigour => vigor",
"waggon => wagon",
"windscreen, windshield => front shield",
"world championship, world cup, worldcup",
"worshipper, worshipping => worshiping",
"yoghourt, yoghurt => yogurt",
"zip, zip code, postal code, postcode",
"zucchini => courgette"
)
I realize that this answer departs somewhat from the OP's initial question, but if you just want to normalize American vs. British English spelling variants, you can look here for a manageably sized list (~1,700 replacements): http://www.tysto.com/uk-us-spelling-list.html. I'm sure there are others out there too that you could use to create a consolidated master list.
Apart from spelling variation, you must be very careful not to blithely replace words in isolation with their (assumed!) counterparts in American English. I would advise against all but the most solid of lexical replacements. E.g., I can't see anything bad happening from this one
"anticlockwise, counterclockwise, counter-clockwise => counter-clockwise"
but this one
"hobo, homeless, tramp => bum"
would index "A homeless man" => *"A bum man", which is nonsense. (Not to mention that hobos, the homeless and "tramps" are quite distinct -- http://knowledgenuts.com/2014/11/26/the-difference-between-hobos-tramps-and-bums/.)
In summary, apart from spelling variation, the American vs. British dialect divide is complicated and cannot be reduced to simple list look-ups.
P.S. If you really want to do this right (i.e., account for grammatical context, etc.), you would probably need a context-sensitive paraphrase model to "translate" British to American English (or the inverse, depending on your needs) before it ever hits the ES index. This could be done (with sufficient parallel data) using an off-the-shelf statistical translation model, or maybe even some custom, in-house software that uses natural language parsing, POS tagging, chunking, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With