Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting and counting unique word frequency from a range

I have a column where each row is a sentence. For example:

COLUMN1

R1: -Do you think they'll come, sir?

R2: -Oh they'll come, they'll come all right.

R3: Here. Stamp those and mail them.

R4: It's ringing.

R5: Would you walk Myron the other way?

From this range, I want to extract a list of unique words (COLUMN2), and a count of how often they appeared in the range (COLUMN3).

The trick is to remove punctuation marks like commas, periods, etc..

So the desired result for the above would be:

COLUMN2    COLUMN3

Do          1

you         2

think       1

they'll     3

come        2

sir         1

Oh          1

all         1

right       1

Here        1

Stamp       1

those       1

and         1

mail        1

them        1

It's        1

ringing     1

Would       1

walk        1

Myron       1

the         1

other       1

way         1

I tried parsing each row with the SPLIT function, separating each word into their own cells, but I'm stuck removing the punctuation, and building the list of unique words (which I know will involve the UNIQUE function). The count I'm guessing will also involve the COUNTUNIQUE function.

Any guidance will be appreciated!

like image 349
Shian Han Avatar asked Oct 21 '25 12:10

Shian Han


1 Answers

You could try something like

=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>'' group by Col1 label Count(Col1)''")

Change range to suit.

Example

If you want to exclude a list of words (ex. in the range J1:J20) you can try

=ArrayFormula(query(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(J1:J20))&"\b'  group by Col1 order by Count(Col1) desc label Count(Col1)''"))

Alternatively, you can also add the list of exclusions to the regex pattern...

=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']|\b((?i)the|oh|or|and)\b" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>''  group by Col1 order by Count(Col1) desc label Count(Col1)''")

UPDATED:

=ArrayFormula(substitute(query(transpose(split(query(regexreplace(substitute(C11:C, char(39), "_"), "[^A-Za-z\s_]" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(substitute(G11:G,char(39),"_")))&"\b'  group by Col1 order by Count(Col1) desc label Count(Col1)''", 0), "_", char(39)))

or, using a different approach

=query(filter(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",), isna(match(upper(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",)), upper(filter(G11:G, len(G11:G))),0))), "Select Col1, count(Col1) group by Col1 order by count(Col1) desc label count(Col1)''", 0)
like image 124
JPV Avatar answered Oct 23 '25 07:10

JPV