Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting most used words from a column of strings in SQL

So we have this database filled with a bunch of strings, in this case post titles.

What I want to do is:

  1. Split the string up in to words
  2. Count how many times words appear in strings
  3. Give me to top 50 words
  4. Not have this timeout in a data.se query

I tried using the info from this SO question adapted to data.se as follows:

select word, count(*) from (
select (case when instr(substr(p.Title, nums.n+1), ' ') then substr(p.Title, nums.n+1)
             else substr(p.Title, nums.n+1, instr(substr(p.Title, nums.n+1), ' ') - 1)
        end) as word
from (select ' '||Title as string
      from Posts p
     )Posts cross join
     (select 1 as n union all select 2 union all select 10
     ) nums
where substr(p.Title, nums.n, 1) = ' ' and substr(p.Title, nums.n, 1) <> ' '
) w
group by word
order by count(*) desc

Unfortunately, this gives me a slew of errors:

'substr' is not a recognized built-in function name. Incorrect syntax near '|'. Incorrect syntax near 'nums'.

So given a column of strings in SQL with a variable amount of text in each string, how can I get a list of the most frequently used X words?

like image 535
jmac Avatar asked May 26 '16 01:05

jmac


People also ask

How to search for word electronic in all column values in SQL?

In the above query, SQL contains is used to search for a word 'electronic' in all column values The first argument of SQL Contain operator is the asterisk (*), it specified all searches in all full-text index columns, and the second argument is the ‘electronic’ word to be search

How to search a column in SQL with two arguments?

The first argument in the SQL contains function is the * which indicated search in the all column values, the second argument is the NEAR operator with two arguments words to be search column and the second is the word which around the given word is to be searched

How to get word with most number of occurrences in Python?

Given Strings List, write a Python program to get word with most number of occurrences. Explanation : gfg occurs 3 times, most in strings in total. Explanation : geeks occurs 2 times, most in strings in total. In this, we perform task of getting each word using split (), and increase its frequency by memorizing it using defaultdict ().

How do you search for a column in a table?

The first argument is the name of the table column you want to be searched; the second argument is the substring you want to find in the first argument column value SQL Contains is a predicate that can be used to search for a word, the prefix of a word, a word near another word, synonym of a word, etc.


Video Answer


1 Answers

Query solution (No Split Function Required)

PostgreSQL

select word, count(*) from 
(
    -- get 1st words
    select split_part(title, ' ', 1) as word
    from posts

    union all

    -- get 2nd words
    select split_part(title, ' ', 2) as word
    from posts

    union all

    -- get 3rd words
    select split_part(title, ' ', 3) as word
    from posts

    -- can do this as many times as the number of words in longest title

) words
where word is not null
and word NOT IN ('', 'and', 'for', 'of', 'on')
group by word
order by count desc
limit 50;

for a concise version, see: https://dba.stackexchange.com/a/82456/95929

like image 178
Ali Saeed Avatar answered Sep 30 '22 19:09

Ali Saeed