I am trying to update a large table (about 1M rows) with the count of words in a field on Postgresql.
This query works, and sets the token_count
field counting the words (tokens) in longtext
in table my_table
:
UPDATE my_table mt SET token_count =
(select count(token) from
(select unnest(regexp_matches(t.longtext, E'\\w+','g')) as token
from my_table as t where mt.myid = t.myid)
as tokens);
myid
is the primary key of the table.
\\w+
is necessary because I want to count words, ignoring special characters.
For example, A test . ; )
would return 5 with space-based count, while 2 is the right value.
The issue is that it's horribly slow, and 2 days are not enough to complete it on 1M rows.
What would you do to optimised it? Are there ways to avoid the join?
How can I split the batch into blocks, using for example limit
and offset
?
Thanks for any tips,
Mulone
UPDATE: I measured the performance of the array_split, and the update is gonna be slow anyway. So maybe a solution would consist of parallelising it. If I run different queries from psql
, only one query works and the others wait for it to finish. How can I parallelise an update?
CHAR_LENGTH() function The PostgreSQL char_length function or character_length function is used to count the number of characters in a specified string.
1) COUNT(*) You can use the PostgreSQL COUNT(*) function along with a SELECT statement to return the total number of rows in a table including the NULL values as well as the duplicates.
The PostgreSQL Text data type is used to keep the character of infinite length. And it can hold a string with a maximum length of 65,535 bytes.
PostgreSQL does not impose a limit on the total size of a database. Databases of 4 terabytes (TB) are reported to exist. A database of this size is more than sufficient for all but the most demanding applications.
Have you tried using array_length
?
UPDATE my_table mt
SET token_count = array_length(regexp_split_to_array(trim(longtext), E'\\W+','g'), 1)
http://www.postgresql.org/docs/current/static/functions-array.html
# select array_length(regexp_split_to_array(trim(' some long text '), E'\\W+'), 1);
array_length
--------------
3
(1 row)
UPDATE my_table
SET token_count = array_length(regexp_split_to_array(longtext, E'\\s+'), 1)
Or your original query without a correlation
UPDATE my_table
SET token_count = (
select count(*)
from (select unnest(regexp_matches(longtext, E'\\w+','g'))) s
);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With