Is there any way to get information about positions of a lexeme in a sentence and number of occurrences from tsvector?
Something like this
SELECT *
FROM get_position(to_tsvector('english', 'The Fat Rats'), to_tsquery('Rats'));
will return 3
and
SELECT *
FROM get_occurrences(to_tsvector('english', 'The Fat Rats'), to_tsquery('Rats'));
will return 1.
Textual representation of a tsvector contains a list of occurrences for particular lexeme:
test=# select to_tsvector ( 'english', 'new bar in New York' );
to_tsvector
----------------------------
'bar':2 'new':1,4 'york':5
Below is exemplary function relying on that. It accepts text parameters and converts them to ts_vector internally, but could be easily rewritten to accept ts_vector.
CREATE OR REPLACE FUNCTION lexeme_occurrences (
IN _document text
, IN _word text
, IN _config regconfig
, OUT lexeme_count int
, OUT lexeme_positions int[]
) RETURNS RECORD
AS $$
DECLARE
_lexemes tsvector := to_tsvector ( _config, _document );
_searched_lexeme tsvector := strip ( to_tsvector ( _config, _word ) );
_occurences_pattern text := _searched_lexeme::text || ':([0-9,]+)';
_occurences_list text := substring ( _lexemes::text, _occurences_pattern );
BEGIN
SELECT
count ( a )
, array_agg ( a::int )
FROM regexp_split_to_table ( _occurences_list, ',' ) a
WHERE _searched_lexeme::text != '' -- preventing false positives
INTO
lexeme_count
, lexeme_positions;
RETURN;
END $$ LANGUAGE plpgsql;
Example usage:
select * from lexeme_occurrences ( 'The Fat Rats', 'rat', 'english' );
lexeme_count | lexeme_positions
--------------+-----------------
1 | {3}
(1 row)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With