I am effectively trying to solve the same problem as this question:
Finding related words (specifically physical objects) to a specific word
minus the requirement that words represent physical objects. The answers and edited question seem to indicate that a good start is building a list of frequency of n-grams using wikipedia text as a corpus. Before I start downloading the mammoth wikipedia dump, does anyone know if such a list already exists?
PS if the original poster of the previous question sees this, I would love to know how you went about solving the problem, as your results seem excellent :-)
An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation.
n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing), computational biology (for instance, biological sequence analysis), and data compression.
Overview. N-gram indexing is a powerful method for getting fast, “search as you type” functionality like iTunes. It is also useful for quick and effective indexing of languages such as Chinese and Japanese without word breaks.
N-grams is a word string in its simplest definition. It calculates the probabilities of words that will come afterward. It is understood that it looks a little difficult when transferred in this way. However, it is quite simple and straightforward.
Google has a publicly available terabyte n-garam database (up to 5).
You can order it in 6 DVDs or find a torrent that hosts it.
You can find the June 2008 Wikipedia n-grams here. In addition it also has headwords and tagged sentences. I tried to create my own n-grams, but ran out of memory (32Gb) on the bigrams (the current English Wikipedia is massive). It also took about 8 hours to extract the xml, 5 hours for unigrams and 8 hours for bigrams.
The linked n-grams also has the benefit of having been cleaned somewhat since mediawiki and Wikipedia has a lot of junk in between the text.
Here's my Python code:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from datetime import datetime
from collections import deque
from collections import defaultdict
from collections import OrderedDict
import operator
import os
# Loop through all the English Wikipedia Article files and store their path and filename in a list. 4 minutes.
dir = r'D:\Downloads\Wikipedia\articles'
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
t1 = datetime.now()
# For each article (file) loop through all the words and generate unigrams. 1175MB memory use spotted.
# 12 minutes to first output. 4200000: 4:37:24.586706 was last output.
c = 1
d1s = defaultdict(int)
for file in l:
try:
with open(file, encoding="utf8") as f_in:
content = f_in.read()
except:
with open(file, encoding="latin-1") as f_in:
content = f_in.read()
words = wordpunct_tokenize(content) # word_tokenize handles 'n ʼn and ʼn as a single word. wordpunct_tokenize does not.
# Take all the words from the sentence and count them.
for i, word in enumerate(words):
d1s[word] = d1s[word] + 1
c = c + 1
if c % 200000 == 0:
t2 = datetime.now()
print(str(c) + ': ' + str(t2 - t1))
t2 = datetime.now()
print('After unigram: ' + str(t2 - t1))
t1 = datetime.now()
# Sort the defaultdict in descending order and write the unigrams to a file.
# 0:00:27.740082 was output. 3285Mb memory. 165Mb output file.
d1ss = OrderedDict(sorted(d1s.items(), key=operator.itemgetter(1), reverse=True))
with open("D:\\Downloads\\Wikipedia\\en_ngram1.txt", mode="w", encoding="utf-8") as f_out:
for k, v in d1ss.items():
f_out.write(k + '┼' + str(v) + "\n")
t2 = datetime.now()
print('After unigram write: ' + str(t2 - t1))
# Determine the lowest 1gram count we are interested in.
low_count = 20 - 1
d1s = {}
# Get all the 1gram counts as a dict.
for word, count in d1ss.items():
# Stop adding 1gram counts when we reach the lowest 1gram count.
if count == low_count:
break
# Add the count to the dict.
d1s[word] = count
t1 = datetime.now()
# For each article (file) loop through all the sentences and generate 2grams. 13GB memory use spotted.
# 17 minutes to first output. 4200000: 4:37:24.586706 was last output.
c = 1
d2s = defaultdict(int)
for file in l:
try:
with open(file, encoding="utf8") as f_in:
content = f_in.read()
except:
with open(file, encoding="latin-1") as f_in:
content = f_in.read()
# Extract the sentences in the file content.
sentences = deque()
sentences.extend(sent_tokenize(content))
# Get all the words for one sentence.
for sentence in sentences:
words = wordpunct_tokenize(sentence) # word_tokenize handles 'n ʼn and ʼn as a single word. wordpunct_tokenize does not.
# Take all the words from the sentence with high 1gram count that are next to each other and count them.
for i, word in enumerate(words):
if word in d1s:
try:
word2 = words[i+1]
if word2 in d1s:
gram2 = word + ' ' + word2
d2s[gram2] = d2s[gram2] + 1
except:
pass
c = c + 1
if c % 200000 == 0:
t2 = datetime.now()
print(str(c) + ': ' + str(t2 - t1))
t2 = datetime.now()
print('After bigram: ' + str(t2 - t1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With