Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving filepath search in mysql

I have several million filenames I need to search through. They look like this:

LG_MARGINCALL_HD2CH_127879834_EN.mov

If someone searches for any of the following, it should match:

  • margin
  • margin call
  • margin call mov
  • margin call hd en
  • margin call hd en mov

What I am currently using is a mysql %LIKE% search. Something like:

SELECT filename FROM path WHERE filename LIKE '%margin%' AND filename LIKE '%mov%'

It is deathly slow (can take up to ten seconds for a search). Note that it does work though.

What would be a better way to do the above search? Either using mysql or another program.

like image 824
David542 Avatar asked Mar 23 '14 17:03

David542


Video Answer


1 Answers

Your search strategy is, as you noticed, slow. It's slow because

 LIKE '%something%'

has to scan the table to find matches. Leading % signs on LIKE searches are an excellent way to wreck performance.

I don't know how many columns are in your path table. If there are a lot of columns you could do two quick things to improve performance:

  1. get rid of SELECT * and list the names of the columns you want in your resultset.
  2. create a compound index consisting of your filename column followed by the other columns you need to retrieve.

(This won't help if you only have a few columns in your table.)

You can't use straight-out-of-the-software-package FULLTEXT searching for this stuff, because that's designed for language text.

If I had to make this work fast for production, I would do this:

First, create a new table called "searchterm" containing

 filename_id INT   the id number of a row in your path table
 searchterm  VARCHAR(20)  a fragment of a filename.

Second, write a program that reads the filename_id and filename values, and inserts a bunch of different rows for each one into searchterm. For the item you've shown the values should be:

LG_MARGINCALL_HD2CH_127879834_EN.mov   (original)
LG  MARGINCALL  HD2CH  127879834  EN  mov   (split on punctuation)
 HD 2 CH                                    (split on embedded numerics)
 MARGIN CALL                                (split on an app-specific list of words)

So, you'd have a bunch of entries in your searchterm table, all with the same filename_id value and lots of different little chunks of text.

Finally, when searching you could do this.

 SELECT path.id, path.filename, path.whatever,
        COUNT(DISTINCT searchterms.term) AS termcount
   FROM path
   JOIN searchterm ON path.filenanme_id = search.filename_id
  WHERE searchterm.term  IN ('margin','call','hd','en', 'mov')
  GROUP BY path.id, path.filename, path.whatever
  ORDER BY path.filename, COUNT(DISTINCT searchterms.term) DESC

This little query finds all the matching fragments to what you're search for. It returns multiple file names, and it presents them in order of what matches the most terms.

What I'm suggesting is that you create your own application-specific kinda- sorta- full text search system. If your really have several million multimedia files, this is surely worth your effort.

like image 105
O. Jones Avatar answered Sep 22 '22 01:09

O. Jones