Setting: I have data on people, and their parent's names, and I want to find siblings (people with identical parent names).
pdata<-data.frame(parents_name=c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else"))
The expected output here would be a column indicating that the first two observations belong to family X, while the third and fourth columns are each in a separate family. E.g:
person_id parents_name family_id
1 "peter pan + marta steward", 1
2 "pieter pan + marta steward", 1
3 "armin dolgner + jane johanna dough", 2
4 "jack jackson + sombody else" 3
Current approach: I am flexible regarding the distance metric. Currently, I use Levenshtein edit-distance to match obs, allowing for two-character differences. But other variants such as "largest common sub string" would be fine if they run faster.
For smaller subsamples I use stringdist::stringdist
in a loop or stringdist::stringdistmatrix
, but this is getting increasingly inefficient as sample size increases.
The matrix version explodes once a certain sample size is used. My terribly inefficient attempt at looping is here:
#create data of the same complexity using random last-names
#(4mio obs and ~1-3 kids per parents)
pdata<-data.frame(parents_name=paste0(rep(c("peter pan + marta ",
"pieter pan + marta ",
"armin dolgner + jane johanna ",
"jack jackson + sombody "),1e6),stringi::stri_rand_strings(4e6, 5)))
for (i in 1:nrow(pdata)) {
similar_fatersname0<-stringdist::stringdist(pdata$parents_name[i],pdata$parents_name[i:nrow(pdata)],nthread=4)<2
#[create grouping indicator]
}
My question: There should be substantial efficiency gains, e.g. because I could stop comparing strings once I found them to sufficiently different in something that is easier to assess, eg. string length, or first word. The string length variant already works and reduces complexity by a factor ~3. But thats by far too little. Any suggestions to reduce computation time are appreciated.
Remarks:
There are two challenges:
A. The parallel execution of Levenstein distance - instead of a sequential loop
B. The number of comparisons: if our source list has 4 million entries, theoretically we should run 16 trillion of Levenstein distance measures, which is unrealistic, even if we resolve the first challenge.
To make my use of language clear, here are our definitions
Technically the goal is to find series of matching expressions in the 4-million expressions list. Two expressions are considered matching expression if their Levenstein distance is less than 2.
Practically we create two lists, which are exact copies of the initial 4-million expressions list. We call then the Left list and the Right list. Each expression is assigned an expression id before duplicating the list. Our goal is to find entries in the Right list which have a Levenstein distance of less than 2 to entries of the Left list, excluding the same entry (same expression id).
I suggest a two step approach to resolve the two challenges separately. The first step will reduce the list of the possible matching expressions, the second will simplify the Levenstein distance measurement since we only look at very close expressions. The technology used is any traditional database server because we need to index the data sets for performance.
CHALLENGE A
The challenge A consists of reducing the number of distance measurements. We start from a maximum of approx. 16 trillion (4 million to the power of two) and we should not exceed a few tens or hundreds of millions. The technique to use here consists of searching for at least one similar word in the complete expression. Depending on how the data is distributed, this will dramatically reduce the number of possible matching pairs. Alternatively, depending on the required accuracy of the result, we can also search for pairs with at least two similar words, or with at least half of similar words.
Technically I suggest to put the expression list in a table. Add an identity column to create a unique id per expression, and create 12 character columns. Then parse the expressions and put each word of each section in a separate column. This will look like (I have not represented all the 12 columns, but the idea is below):
|id | expression | sect_a_w_1 | sect_a_w_2 | sect_b_w_1 |sect_b_w_2 |
|1 | peter pan + marta steward | peter | pan | marta |steward |
There are empty columns (since there are very few expressions with 12 words) but it does not matter.
Then we replicate the table and create an index on every sect... column. We run 12 joins which try to find similar words, something like
SELECT L.id, R.id
FROM left table L JOIN right table T
ON L.sect_a_w_1 = R.sect_a_w_1
AND L.id <> R.id
We collect the output in 12 temp tables and run an union query of the 12 tables to get a short list of all expressions which have a potential matching expressions with at least one identical word. This is the solution to our challenge A. We now have a short list of the most likely matching pairs. This list will contain millions of records (pairs of Left and Right entries), but not billions.
CHALLENGE B
The goal of challenge B is to process a simplified Levenstein distance in batch (instead of running it in a loop). First we should agree on what is a simplified Levenstein distance. First we agree that the levenstein distance of two expressions is the sum of the levenstein distance of all the words of the two expressions which have the same index. I mean the Levenstein distance of two expressions is the distance of their two first words, plus the distance of their two second words, etc. Secondly, we need to invent a simplified Levenstein distance. I suggest to use the n-gram approach with only grams of 2 characters which have an index absolute difference of less than 2 .
e.g. the distance between peter and pieter is calculated as below
Peter
1 = pe
2 = et
3 = te
4 = er
5 = r_
Pieter
1 = pi
2 = ie
3 = et
4 = te
5 = er
6 = r_
Peter and Pieter have 4 common 2-grams with an index absolute difference of less than 2 'et','te','er','r_'. There are 6 possible 2-grams in the largest of the two words, the distance is then 6-4 = 2 - The Levenstein distance would also be 2 because there's one move of 'eter' and one letter insertion 'i'.
This is an approximation which will not work in all cases, but I think in our situation it will work very well. If we're not satisfied with the quality of the results we can try with 3-grams or 4-grams or allow a larger than 2 gram sequence difference. But the idea is to execute much fewer calculations per pair than in the traditional Levenstein algorithm.
Then we need to convert this into a technical solution. What I have done before is the following: First isolate the words: since we need only to measure the distance between words, and then sum these distances per expression, we can further reduce the number of calculations by running a distinct select on the list of words (we have already prepared the list of words in the previous section).
This approach requires a mapping table which keeps track of the expression id, the section id, the word id and the word sequence number for word, so that the original expression distance can be calculated at the end of the process.
We then have a new list which is much shorter, and contains a cross join of all words for which the 2-gram distance measure is relevant. Then we want to batch process this 2-gram distance measurement, and I suggest to do it in a SQL join. This requires a pre-processing step which consists of creating a new temporary table which stores every 2-gram in a separate row – and keeps track of the word Id, the word sequence and the section type
Technically this is done by slicing the list of words using a series (or a loop) of substring select, like this (assuming the word list tables - there are two copies, one Left and one Right - contain 2 columns word_id and word) :
INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 1 AS gram_seq, SUBSTRING(word,1,2) AS gram
FROM left_word_table
And then
INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 2 AS gram_seq, SUBSTRING(word,2,2) AS gram
FROM left_word_table
Etc.
Something which will make “steward” look like this (assume the word id is 152)
| pk | word_id | gram_seq | gram |
| 1 | 152 | 1 | st |
| 2 | 152 | 2 | te |
| 3 | 152 | 3 | ew |
| 4 | 152 | 4 | wa |
| 5 | 152 | 5 | ar |
| 6 | 152 | 6 | rd |
| 7 | 152 | 7 | d_ |
Don't forget to create an index on the word_id, the gram and the gram_seq columns, and the distance can be calculated with a join of the left and the right gram list, where the ON looks like
ON L.gram = R.gram
AND ABS(L.gram_seq + R.gram_seq)< 2
AND L.word_id <> R.word_id
The distance is the length of the longest of the two words minus the number of the matching grams. SQL is extremely fast to make such a query, and I think a simple computer with 8 gigs of RAM would easily do several hundred of million lines in a reasonable time frame.
And then it's only a matter of joining the mapping table to calculate the sum of word to word distance in every expression, to get the total expression to expression distance.
You are using the stringdist
package anyway, does stringdist::phonetic()
suit your needs? It computes the soundex code for each string, eg:
phonetic(pdata$parents_name)
[1] "P361" "P361" "A655" "J225"
Soundex is a tried-and-true method (almost 100 years old) for hashing names, and that means you don't need to compare every single pair of observations.
You might want to go further and do soundex on first name and last name seperately for father and mother.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With