Mysql Matching "Same" Emails

Question

I have a table with 2 columns email and id. I need to find emails that are closely related. For example:

[email protected]

and

[email protected]

These should be considered the same because the username (john.smith12) and the most top level domain (example.com) are the same. They are currently 2 different rows in my table. ~~I've written the below expression which should do that comparison but it takes hours to execute (possibly/probably because of regex). Is there a better way to write this:~~

  select c1.email, c2.email 
  from table as c1
  join table as c2
   on (
             c1.leadid <> c2.leadid 
        and 
             c1.email regexp replace(replace(c2.email, '.', '[.]'), '@', '@[^@]*'))

The explain of this query comes back as:

id, select_type, table, type, possible_keys, key, key_len, ref,  rows,   Extra
1,  SIMPLE,      c1,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, NULL
1,  SIMPLE,      c2,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, Using where; Using join buffer (Block Nested Loop)

The create table is:

CREATE TABLE `table` (
 `ID` int(11) NOT NULL AUTO_INCREMENT,
 `Email` varchar(100) DEFAULT NULL,
 KEY `Table_Email` (`Email`),
 KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

I guess the indices aren't being used because of the regexp.

The regex comes out as:

john[.]smith12@[^@]*example[.]com

which should match both addresses.

Update:

I've modified the on to be:

on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('@', c1.email) -1)) = substr(c2. email, 1, (locate('@', c2.email) -1))
and    
substr(c1.email, locate('@', c1.email) + 1) like concat('%', substr(c2.email, locate('@', c2.email) + 1)))

and the explain with this approach is at least using the indices.

id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)

So far this has executed for 5 minutes, will update if there is a vast improvement.

Update 2:

I've split the email so the username is a column and domain is a column. I've stored the domain in reverse order so the index of it can be used with a trailing wildcard.

CREATE TABLE `table` (
     `ID` int(11) NOT NULL AUTO_INCREMENT,
     `Email` varchar(100) DEFAULT NULL,
     `domain` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
     `username` varchar(500) CHARACTER SET utf8 DEFAULT NULL,
     KEY `Table_Email` (`Email`),
     KEY `Email` (`Email`),
     KEY `domain` (`domain`)
    ) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

Query to populate new columns:

update table
set username = trim(SUBSTRING_INDEX(trim(email), '@', 1)), 
domain = reverse(trim(SUBSTRING_INDEX(SUBSTRING_INDEX(trim(email), '@', -1), '.', -3)));

New query:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
join table as c2
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

New Explain Results:

1, SIMPLE, c1, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

From that explain it looks like the domain index is not being used. I also tried to force the usage with USE but that also didn't work, that resulted in no indices being used:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
USE INDEX (domain)
join table as c2
USE INDEX (domain)
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

Explain with use:

1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

Vladimir Baranov · Accepted Answer

You told us that the table has 700K rows.

This is not much, but you are joining it to itself, so in the worst case the engine would have to process 700K * 700K = 490 000 000 000 = 490B rows.

An index can definitely help here.

The best index depends on the data distribution.

What does the following query return?

SELECT COUNT(DISTINCT username) 
FROM table

If result is close to 700K, say 100K, then it means that there are a lot of different usernames and you'd better focus on them, rather than domain. If result is low, say, 100, than indexing username is unlikely to be useful.

I hope that there are a lot of different usernames, so, I'd create an index on username, since the query joins on that column using simple equality comparison and this join would greatly benefit from this index.

Another option to consider is a composite index on (username, domain) or even covering index on (username, domain, leadid, email). The order of columns in the index definition is important.

I'd delete all other indexes, so that optimiser can't make another choice, unless there are other queries that may need them.

Most likely it won't hurt to define a primary key on the table as well.

There is one more not so important thing to consider. Does your data really have NULLs? If not, define the columns as NOT NULL. Also, in many cases it is better to have empty strings, rather than NULLs, unless you have very specific requirements and you have to distinguish between NULL and ''.

The query would be slightly simpler:

select 
    c1.email, c2.email, 
    c1.domain, c2.domain, 
    c1.username, c2.username, 
    c1.leadid, c2.leadid
from 
    table as c1
    join table as c2
        on  c1.username = c2.username 
        and c1.domain like concat(c2.domain, '%')
        and c1.leadid <> c2.leadid

Mysql Matching "Same" Emails

Tags:

regex

sql

mysql

self-join

user3783243

1 Answers

Vladimir Baranov

Recent Activity

Donate For Us

Mysql Matching "Same" Emails

Tags:

regex

sql

mysql

self-join

user3783243

1 Answers

Vladimir Baranov

Related questions

Recent Activity

Donate For Us