Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to create a query in sql to chop sentences into words and add them to new table with their frequency

I'm trying to do a query that I'm not sure if it's possible I have a table called sentencess which contain ID, Sentences, and verify as shown in the picture bellow.

enter image description here

I have another table called, word count which contains ID, words, and there frequency. so I want when ever if a sentence entered updated, or deleted for this table to be updated accordingly or updated ones a day because there might be a lot of sentences

my expected output is something like the picture bellow.

enter image description here

any ideas is this doable can anyone help please.

like image 875
programming freak Avatar asked May 07 '20 08:05

programming freak


2 Answers

If you are running MySQL 8.0, I would recommend a recursive common table expression for this. The idea is to iteratively walk each message, splitting it into words along the way. All that is then left to do is to aggregate.

with recursive cte as (
    select 
        substring(concat(sent, ' '), 1, locate(' ', sent)) word,
        substring(concat(sent, ' '), locate(' ', sent) + 1) sent
    from messages
    union all
    select 
        substring(sent, 1, locate(' ', sent)) word,
        substring(sent, locate(' ', sent) + 1) sent
    from cte
    where locate(' ', sent) > 0
)
select row_number() over(order by count(*) desc, word) wid, word, count(*) freq
from cte 
group by word
order by wid

In earlier versions, you could emulate the same behavior with a numbers table.

Demo on DB Fiddle

Sample data:

sent                       | verif
:------------------------- | ----:
hello my name is alex      |  null
hey alin and alex I'm tom  |  null
hello alex my name is alin |  null

Results:

wid | word   | freq
--: | :----- | ---:
  1 | alex   |    3
  2 | alin   |    2
  3 | hello  |    2
  4 | is     |    2
  5 | my     |    2
  6 | name   |    2
  7 | and    |    1
  8 | hey    |    1
  9 | I'm    |    1
 10 | tom    |    1

When it comes to maintaining the results of the query in a separate table, it is probably more complicated than you think: you need to be able to insert, delete or update the target table depending on the changes in the original table, which cannot be done in a single statement in MySQL. Also, keeping a flag up to date in the original table creates a race condition, where changes might occur while your are updating the target target table.

A simpler option would be to put the query in a view, so you get an always-up-to-date perspective on your data. For this, you can just wrap the above query in a create view statement, like:

create view words_view as < above query >;

If performance becomes a problem, then you could also truncate and refill the words table periodically.

truncate table words;
insert into words < above query >;
like image 84
GMB Avatar answered Oct 13 '22 11:10

GMB


Perl and PHP and others have a much more robust regexp engine for splitting. I would use one of them, not SQL.

I would use batch inserts, using

INSERT INTO words (word, ct)
    VALUES ('this', 1), ('that', 1), ...   -- about 100 words at a time
    ON DUPLICATE KEY UPDATE ct = VALUES(ct) + 1;

CREATE TABLE words (
    word VARCHAR(66) NOT NULL,
    ct MEDIUMINT UNSIGNED NOT NULL,
    PRIMARY KEY(word)
) ENGINE=InnoDB;

I see no need for having words and counts in separate tables, nor any need for an AUTO_INCREMENT for a "word_id". The word is a perfectly good "natural PK". However, you should decide what to do about case folding and accent stripping.

As for splitting into words... double-quotes and some other characters are clearly word boundaries. But some characters are ambiguous:

' -- part of a contraction or a quote?
. -- abbreviation or end of a sentence

Etc.

like image 42
Rick James Avatar answered Oct 13 '22 10:10

Rick James