Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching for companies with elasticsearch

Imagine I have two sources of data. One source is calling Mærsk for A.P. Møller - Mærsk A while the other is A.P. Møller - Mærsk A/S. Now I have a lot of companies and I want to streamline the naming of these.

Both sources are indexed in elasticsearch but I am too much of a newbie with this technology to come up with a proper search query. My initial though was to use common which gives decent results, but I figure there are better ways.

Any suggestions?

EDIT

A little clarification. My two sources is just a data source that deliver company names. I've stored these names in its own index for each source - a document is just the name.

So I have two indices with company names (nothing else there). Now for each company name in index A I want find the corresponding company in index B. The challenge is that there are various ways to write a company name - it is not standardized. I want to create this link with as little manual labour as possible and minimal risk for errors as well.

like image 539
mr.bjerre Avatar asked Sep 18 '25 08:09

mr.bjerre


1 Answers

The OP has probably moved on from this question, given it was asked a while ago. And, for example, common has now been deprecated. But in case it helps others, here are some guidelines:

The Problem

As I understand it from the question, the problem is exemplified by this: I have two company names in two different data sources. One is:

A.P. Møller - Mærsk A

The other is:

A.P. Møller - Mærsk A/S

Assuming these represent the same company, the problem is how to resolve these to a single canonical name (for example, "Mærsk" if that is an appropriate name in this case).

Furthermore, how can we perform this matching process across a large set of company names in as automated a way as possible?

One warning - it usually pays to make such tasks repeatable - even if you think it's going to be a one-time-only clean-up exercise, it often doesn't end up that way (IMHO).

One Solution

Getting to a fully-automated matching solution is typically not possible in cases like this - some manual intervention is usually needed. But you may be able to get close.

I will take some liberties - for example, I will ignore the "two different data sources" aspect. Instead, I will assume we have one overall list, the union of both sources (because maybe there are name variants within each list).

Here is what has broadly worked for me in a similar domain (film titles).

FULL DISCLOSURE: I did not use ElasticSearch, in my case. I used Lucene and some custom Java. But in this context, there are many similarities. My references below are all to ElasticSearch v7.5 functionality.

Tokenization

The question indicates that data has already been indexed - but using what tokenization steps? Some suggestions (which may already have been implemented in the OP's case):

  • Consider leaving in stop-words. Not a hard-and-fast rule, but consider what would happen to the band The The if stop-words were removed. There would be nothing to index. In relatively short text such as names, stop-words may be too important to remove.

  • Consider ascii folding, etc. to normalize text (removal of diacritics, such as é to e; expansion of ligatures, such as æ to ae; and so on. This assumes you are using Latin-based text. Less relevant for other scripts (Chinese, etc.).

  • Consider customizations specific to your problem domain. For example, there may be nomenclature variations such as "LTD", "Ltd", etc. representing the word "Limited" in company names. Or the use of ampersands (&) in some examples, but "and" in others. "Smith & Sons, Ltd" versus "Smith and Sons Limited".

  • other transformations such as lowercase and removal of punctuation are more straightforward.

Supporting Metadata

The OP may not have access to any of this - but supporting metadata can be vital in determining if two name variants refer to the same entity. An example from the world of film titles: There are two movies in IMDb called "Kicking and Screaming" - and numerous TV episodes. They can be distinguished from each other by comparing related metadata such as:

  • type of release (movie, TV episode, etc).
  • year of initial release (perhaps with a +/- tolerance threshold).

I don't know what the equivalent might be for companies.

A fairly crude technique would be to append such data to each company name, thus increasing the number of tokens available in each indexable term.

Or, the metadata data can be used downstream to further verify whether two terms match or not.

Matching & Score Thresholds

Let's assume we have simple word-boundary indexed terms (although there are plenty of other ways to go - ngrams, shingles, etc.).

Now we perform a search on each company name (plus additional metadata, if we added it).

Let's assume we have defined a threshold score that must be reached for a search result to be considered a match. The score should be easily adjustable to tune matching behavior.

If we get only one match which exceeds this score, we can assume we have an automated match: the two names represent the same underlying company.

If we get zero matches which exceed this score, then we can assume the company name is unique in our data set.

If we get multiple matches, then that is the point at which manual intervention may be needed, to determine if the names are equivalent or not.

Test Cases

The aim is to minimize false positive matches, while also minimizing match misses.

How do you know?

The only good answer I have for this is to generate a set of test cases. And the best way to do that is to study the data, so you can find suitably cunning & devious cases to test.

Conclusion

This all sounds like a lot of work. How much of it you actually do, or how little - how rigorous or how cursory - is up to you. Depends on your context, of course.

like image 182
andrewJames Avatar answered Sep 21 '25 10:09

andrewJames