I have a column where each row is a sentence. For example:
COLUMN1
R1: -Do you think they'll come, sir?
R2: -Oh they'll come, they'll come all right.
R3: Here. Stamp those and mail them.
R4: It's ringing.
R5: Would you walk Myron the other way?
From this range, I want to extract a list of unique words (COLUMN2), and a count of how often they appeared in the range (COLUMN3).
The trick is to remove punctuation marks like commas, periods, etc..
So the desired result for the above would be:
COLUMN2    COLUMN3
Do          1
you         2
think       1
they'll     3
come        2
sir         1
Oh          1
all         1
right       1
Here        1
Stamp       1
those       1
and         1
mail        1
them        1
It's        1
ringing     1
Would       1
walk        1
Myron       1
the         1
other       1
way         1
I tried parsing each row with the SPLIT function, separating each word into their own cells, but I'm stuck removing the punctuation, and building the list of unique words (which I know will involve the UNIQUE function). The count I'm guessing will also involve the COUNTUNIQUE function.
Any guidance will be appreciated!
You could try something like
=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>'' group by Col1 label Count(Col1)''")
Change range to suit.

If you want to exclude a list of words (ex. in the range J1:J20) you can try
=ArrayFormula(query(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(J1:J20))&"\b'  group by Col1 order by Count(Col1) desc label Count(Col1)''"))
Alternatively, you can also add the list of exclusions to the regex pattern...
=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']|\b((?i)the|oh|or|and)\b" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>''  group by Col1 order by Count(Col1) desc label Count(Col1)''")
UPDATED:
=ArrayFormula(substitute(query(transpose(split(query(regexreplace(substitute(C11:C, char(39), "_"), "[^A-Za-z\s_]" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(substitute(G11:G,char(39),"_")))&"\b'  group by Col1 order by Count(Col1) desc label Count(Col1)''", 0), "_", char(39)))
or, using a different approach
=query(filter(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",), isna(match(upper(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",)), upper(filter(G11:G, len(G11:G))),0))), "Select Col1, count(Col1) group by Col1 order by count(Col1) desc label count(Col1)''", 0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With