Combining Lists of Word Frequency Data

Question

This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results.

I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica. Each list appears as the following and has been generated using the Tally[ ] and Sort[ ] commands:

{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

Here is an example of the second file:

{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"to", 12726}, {"a", 12635}, {"in", 11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 7748}, <<101032>>, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

I want to combine them so that the frequency data aggregates: i.e. if the second file has 30,419 occurrences of 'the' and is joined to the first file, it should return that there are 72,635 occurrences (and so on as I move through the entire collection).

Szabolcs · Accepted Answer

It sounds like you need GatherBy.

Suppose your two lists are named data1 and data2, then use

{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]

This easily generalizes to any number of lists, not just two.

Combining Lists of Word Frequency Data

Tags:

wolfram-mathematica

word-frequency

canadian_scholar

1 Answers

Szabolcs

Recent Activity

Donate For Us

Combining Lists of Word Frequency Data

Tags:

wolfram-mathematica

word-frequency

canadian_scholar

1 Answers

Szabolcs

Related questions

Recent Activity

Donate For Us