Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL LIKE, how to sort results by weighted occurrence count?

I have my search term:

"Yellow large widgets"

I split the terms into its 3 words:

1 = "Yellow";
2 = "Large";
2 = "Widgets";

I then search with:

SELECT * FROM widgets
    WHERE (description LIKE '%yellow%' OR description LIKE '%large%' OR description LIKE 'widgets') 
    OR (title LIKE '%yellow%' OR title LIKE '%large%' OR title LIKE '%widgets%')

How can I sort the results with these biases?

  • Title takes presidence, if any of the terms appear in title they should be considered more important
  • Occurrence counts, results with higher total occurances should appear first

Ideal Methodoloy

  • Count occurrences in description.
  • Each occurrence here is worth 1 point.
  • Count occurrences in title.
  • Each title occurrence is worth 5 points.
  • Sort by points.

But I wouldn't know where to start doing that in SQL.

like image 933
Tom Gullen Avatar asked Feb 24 '11 11:02

Tom Gullen


2 Answers

Okay, let's have your search terms in a temp table:

CREATE TABLE #SearchTerms (Term varchar(50) not null)
insert into #SearchTerms (Term)
select 'yellow' union all
select 'large' union all
select 'widgets'

And let's do something silly:

select
    widgets.ID,
    (LEN(description) - LEN(REPLACE(description,Term,''))) / LEN(Term) as DescScore
    (LEN(title) - LEN(REPLACE(title,Term,''))) / LEN(Term) as TitleScore
from
    widgets,#SearchTerms

We've now counted each occurrence of each term, in both the description and the title.

So now we can sum and weight those occurrences:

select
    widgets.ID,
    SUM((LEN(description) - LEN(REPLACE(description,Term,''))) / LEN(Term) +
    ((LEN(title) - LEN(REPLACE(title,Term,''))) / LEN(Term) *5)) as CombinedScore
from
    widgets,#SearchTerms
group by
    Widgets.ID

And if we need to do more with this, I'd recommend putting the above in a subselect

select
    w.*,CombinedScore
from
    widgets.w
       inner join
    (select
        widgets.ID,
        SUM((LEN(description) - LEN(REPLACE(description,Term,''))) / LEN(Term) +
        ((LEN(title) - LEN(REPLACE(title,Term,''))) / LEN(Term) *5)) as CombinedScore
    from
        widgets,#SearchTerms
    group by
        Widgets.ID
    ) t
        on
            w.ID = t.ID
where
    CombinedScore > 0
order by
    CombinedScore desc

(Note that I've assumed there's an ID column in all these examples, but that can be expanded into as many columns as are necessary to define the PK in the widgets table)


The real trick here is counting occurrences of a word within a larger body of text, which is done by:

(LEN(text) - LEN(text with each occurrence of term removed)) / LEN(term)
like image 194
Damien_The_Unbeliever Avatar answered Dec 21 '22 23:12

Damien_The_Unbeliever


Choose one of the following options:

  • use a fulltext search engine (Lucene or similiar) to do weighted full text search;
  • see this other StackOverflow question
  • do multiple SELECTs, each one with a relevance field, merge them with UNION and sort the result;
  • sort your results in your application, after having retrieved the results.
like image 36
Alessandro Avatar answered Dec 21 '22 23:12

Alessandro