Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most Frequent Words in List

I have a list in Excel, a subset of which looks like this:

Food and Human Nutrition
Food and Human Nutrition with Placement
Food and Nutrition with Professional Experience
Food Marketing and Nutrition
Food Marketing and Nutrition with Placement
Food, Nutrition and Health

I'd like to find the n most common words within this list. I tried with this formula to find the most common word:

=INDEX(rng,MODE(MATCH(rng,rng,0)))

The issue with this is that it views each cell as a single string, and as each of the 6 rows is different does not find a most common word. What I'd like it to do is output 'Food', 'Nutrition' and 'and' as the most common words, followed by 'Marketing', 'Placement', 'with' etc.

like image 435
Nat Aes Avatar asked Oct 18 '25 16:10

Nat Aes


2 Answers

Here is a VBA macro that provides what you seem to want.

  • Uniqueness is tested by using a dictionary object
  • Counts are done in the dictionary
  • Results are then sorted

Read the comments closely in the code for assumptions that need to be made. And reference that needs to be set

Also, note that punctuation can cause the same word to be counted in different categories. If this might be a problem, we merely need to split the source data differently, either eliminating all punctuation before splitting on spaces, or by using Regular Expressions to do the split.

'Set Reference to Microsoft Scripting Runtime

Option Explicit
Sub UniqueWordCounts()
    Dim wsSrc As Worksheet, wsRes As Worksheet
    Dim rSrc As Range, rRes As Range
    Dim vSrc As Variant, vRes As Variant
    Dim vWords As Variant
    Dim dWords As Dictionary
    Dim I As Long, J As Long
    Dim V As Variant, vKey As Variant

'Assume source data is in column 1, starting at A1
'  Could easily be anyplace
Set wsSrc = Worksheets("sheet2")
With wsSrc
    Set rSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With

'Results to go a few columns over
Set wsRes = Worksheets("sheet2")
    Set rRes = rSrc(1, 1).Offset(0, 2)

'Read source data into vba array (for processing speed)
vSrc = rSrc

'Collect individual words and counts into dictionary
Set dWords = New Dictionary
    dWords.CompareMode = TextCompare

For I = 1 To UBound(vSrc, 1)

    'Split the sentence into individual words
    For Each vKey In Split(vSrc(I, 1))
        If Not dWords.Exists(vKey) Then
            dWords.Add Key:=vKey, Item:=1
        Else
            dWords(vKey) = dWords(vKey) + 1
        End If
    Next vKey
Next I

'Size results array
ReDim vRes(0 To dWords.Count, 1 To 2)

'Column headers
    vRes(0, 1) = "Word"
    vRes(0, 2) = "Count"

'Populate the columns
    I = 0
    For Each V In dWords.Keys
        I = I + 1
        vRes(I, 1) = V
        vRes(I, 2) = dWords(V)
    Next V

'Size results range
Set rRes = rRes.Resize(UBound(vRes, 1) + 1, UBound(vRes, 2))

'Populate, format and sort the Results range
With rRes
    .EntireColumn.Clear
    .Value = vRes
    With .Rows(1)
        .Font.Bold = True
        .HorizontalAlignment = xlCenter
    End With
    .EntireColumn.AutoFit
    .Sort key1:=.Columns(2), order1:=xlDescending, key2:=.Columns(1), order2:=xlAscending, MatchCase:=False, Header:=xlYes
End With

End Sub

enter image description here

like image 136
Ron Rosenfeld Avatar answered Oct 20 '25 08:10

Ron Rosenfeld


If you know & want to use VBA, then it would be quite an interestring task. Thus, some custom formula like this =MostCommonWords(Range;Optional WordsNumber) would give you this result:

enter image description here

This is the code, behind the formula:

Public Function MostCommonWords(inputRange As Range, _
                        Optional NumberOfWords As Long = 1) As String

    Dim myCell      As Range
    Dim inputString As String, tempString As String, myResult As String
    Dim myArr       As Variant, myKey As Variant
    Dim cnt         As Long, topNumber As Long
    Dim myColl      As Object

    Set myColl = CreateObject("Scripting.Dictionary")        
    For Each myCell In inputRange
        tempString = LCase(Replace(myCell, ",", ""))
        inputString = inputString & " " & tempString
    Next myCell        
    myArr = Split(inputString)        
    For cnt = LBound(myArr) To UBound(myArr)
        If myColl.exists(myArr(cnt)) Then
            myColl(myArr(cnt)) = myColl(myArr(cnt)) + 1
        Else
            myColl.Add myArr(cnt), 1
        End If
    Next cnt 
    For cnt = 1 To NumberOfWords
        topNumber = 0
        myResult = vbNullString            
        For Each myKey In myColl
            If topNumber < myColl(myKey) Then
                topNumber = myColl(myKey)
                myResult = myKey
            End If
        Next myKey            
        MostCommonWords = MostCommonWords & " " & myResult
        myColl.Remove myResult
    Next cnt

End Function

How does it work?

  • It adds all of the cells from the input range to a string, named inputString.
  • Commas are removed and all words are transfered to lower case.
  • Using a dictionary structure, every word is added as a key and the times it is used as a value.
  • Based on the number of words required to be displayed, a check for the biggest value on the dictionary is carried out and its key is presented.
  • Once the biggest value is found, the key is removed from the dictionary, thus the second biggest value can be found - myColl.Remove myResult.
like image 43
Vityata Avatar answered Oct 20 '25 06:10

Vityata



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!