I have several million filenames I need to search through. They look like this:
LG_MARGINCALL_HD2CH_127879834_EN.mov
If someone searches for any of the following, it should match:
What I am currently using is a mysql %LIKE% search. Something like:
SELECT filename FROM path WHERE filename LIKE '%margin%' AND filename LIKE '%mov%'
It is deathly slow (can take up to ten seconds for a search). Note that it does work though.
What would be a better way to do the above search? Either using mysql or another program.
Your search strategy is, as you noticed, slow. It's slow because
LIKE '%something%'
has to scan the table to find matches. Leading % signs on LIKE
searches are an excellent way to wreck performance.
I don't know how many columns are in your path
table. If there are a lot of columns you could do two quick things to improve performance:
SELECT *
and list the names of the columns you want in your resultset.filename
column followed by the other columns you need to retrieve.(This won't help if you only have a few columns in your table.)
You can't use straight-out-of-the-software-package FULLTEXT
searching for this stuff, because that's designed for language text.
If I had to make this work fast for production, I would do this:
First, create a new table called "searchterm" containing
filename_id INT the id number of a row in your path table
searchterm VARCHAR(20) a fragment of a filename.
Second, write a program that reads the filename_id
and filename
values, and inserts a bunch of different rows for each one into searchterm
. For the item you've shown the values should be:
LG_MARGINCALL_HD2CH_127879834_EN.mov (original)
LG MARGINCALL HD2CH 127879834 EN mov (split on punctuation)
HD 2 CH (split on embedded numerics)
MARGIN CALL (split on an app-specific list of words)
So, you'd have a bunch of entries in your searchterm table, all with the same filename_id
value and lots of different little chunks of text.
Finally, when searching you could do this.
SELECT path.id, path.filename, path.whatever,
COUNT(DISTINCT searchterms.term) AS termcount
FROM path
JOIN searchterm ON path.filenanme_id = search.filename_id
WHERE searchterm.term IN ('margin','call','hd','en', 'mov')
GROUP BY path.id, path.filename, path.whatever
ORDER BY path.filename, COUNT(DISTINCT searchterms.term) DESC
This little query finds all the matching fragments to what you're search for. It returns multiple file names, and it presents them in order of what matches the most terms.
What I'm suggesting is that you create your own application-specific kinda- sorta- full text search system. If your really have several million multimedia files, this is surely worth your effort.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With