Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL optimisation - Word count in string - Postgresql

I am trying to update a large table (about 1M rows) with the count of words in a field on Postgresql. This query works, and sets the token_count field counting the words (tokens) in longtext in table my_table:

UPDATE my_table mt SET token_count = 
    (select count(token) from 
      (select unnest(regexp_matches(t.longtext, E'\\w+','g')) as token
      from my_table as t where mt.myid = t.myid)
    as tokens);

myid is the primary key of the table. \\w+ is necessary because I want to count words, ignoring special characters. For example, A test . ; ) would return 5 with space-based count, while 2 is the right value. The issue is that it's horribly slow, and 2 days are not enough to complete it on 1M rows. What would you do to optimised it? Are there ways to avoid the join?

How can I split the batch into blocks, using for example limit and offset?

Thanks for any tips,

Mulone

UPDATE: I measured the performance of the array_split, and the update is gonna be slow anyway. So maybe a solution would consist of parallelising it. If I run different queries from psql, only one query works and the others wait for it to finish. How can I parallelise an update?

like image 252
Mulone Avatar asked Jun 19 '13 17:06

Mulone


People also ask

How do I count words in PostgreSQL?

CHAR_LENGTH() function The PostgreSQL char_length function or character_length function is used to count the number of characters in a specified string.

What does count (*) do in PostgreSQL?

1) COUNT(*) You can use the PostgreSQL COUNT(*) function along with a SELECT statement to return the total number of rows in a table including the NULL values as well as the duplicates.

What is the length of text in PostgreSQL?

The PostgreSQL Text data type is used to keep the character of infinite length. And it can hold a string with a maximum length of 65,535 bytes.

Does PostgreSQL have limit?

PostgreSQL does not impose a limit on the total size of a database. Databases of 4 terabytes (TB) are reported to exist. A database of this size is more than sufficient for all but the most demanding applications.


2 Answers

Have you tried using array_length?

UPDATE my_table mt
SET token_count = array_length(regexp_split_to_array(trim(longtext), E'\\W+','g'), 1)

http://www.postgresql.org/docs/current/static/functions-array.html

# select array_length(regexp_split_to_array(trim(' some long text  '), E'\\W+'), 1);
 array_length 
--------------
            3
(1 row)
like image 111
Denis de Bernardy Avatar answered Oct 18 '22 16:10

Denis de Bernardy


UPDATE my_table
SET token_count = array_length(regexp_split_to_array(longtext, E'\\s+'), 1)

Or your original query without a correlation

UPDATE my_table
SET token_count = (
    select count(*)
    from (select unnest(regexp_matches(longtext, E'\\w+','g'))) s
    );
like image 37
Clodoaldo Neto Avatar answered Oct 18 '22 16:10

Clodoaldo Neto