Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

determining most used set of words php mysql

Tags:

string

mysql

I'm trying to figure out how to go about determining the most used words on a mysql dataset.

Not sure how to go about this or if there's a simpler approach. Read a couple posts where some suggests an algorithm.

Example:

From 24,500 records, find out the top 10 used words.

like image 335
Codex73 Avatar asked Nov 02 '12 02:11

Codex73


1 Answers

Right, this runs like a dog and is limited to working with a single delimiter, but hopefully will give you an idea.

SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(SomeColumn, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM SomeTable
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(SomeColumn) + 1 - LENGTH(REPLACE(SomeColumn, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != ''
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10

This relies on having a table called integers with a single column called i with 10 rows with the values 0 to 9. It copes with up to ~1000 words but can easily be altered to cope with more (but will slow down even more).

like image 61
Kickstart Avatar answered Oct 25 '22 02:10

Kickstart