Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a database of n-grams taken from wikipedia

I am effectively trying to solve the same problem as this question:

Finding related words (specifically physical objects) to a specific word

minus the requirement that words represent physical objects. The answers and edited question seem to indicate that a good start is building a list of frequency of n-grams using wikipedia text as a corpus. Before I start downloading the mammoth wikipedia dump, does anyone know if such a list already exists?

PS if the original poster of the previous question sees this, I would love to know how you went about solving the problem, as your results seem excellent :-)

like image 732
mojones Avatar asked Feb 24 '10 10:02

mojones


People also ask

What is ngram data?

An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation.

Are n-grams still used?

n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing), computational biology (for instance, biological sequence analysis), and data compression.

What is n-gram search?

Overview. N-gram indexing is a powerful method for getting fast, “search as you type” functionality like iTunes. It is also useful for quick and effective indexing of languages such as Chinese and Japanese without word breaks.

What does n-gram tracing look for?

N-grams is a word string in its simplest definition. It calculates the probabilities of words that will come afterward. It is understood that it looks a little difficult when transferred in this way. However, it is quite simple and straightforward.


2 Answers

Google has a publicly available terabyte n-garam database (up to 5).
You can order it in 6 DVDs or find a torrent that hosts it.

like image 187
Shay Erlichmen Avatar answered Oct 07 '22 01:10

Shay Erlichmen


You can find the June 2008 Wikipedia n-grams here. In addition it also has headwords and tagged sentences. I tried to create my own n-grams, but ran out of memory (32Gb) on the bigrams (the current English Wikipedia is massive). It also took about 8 hours to extract the xml, 5 hours for unigrams and 8 hours for bigrams.

The linked n-grams also has the benefit of having been cleaned somewhat since mediawiki and Wikipedia has a lot of junk in between the text.

Here's my Python code:

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from datetime import datetime
from collections import deque
from collections import defaultdict
from collections import OrderedDict
import operator
import os

# Loop through all the English Wikipedia Article files and store their path and filename in a list. 4 minutes.
dir = r'D:\Downloads\Wikipedia\articles'
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]

t1 = datetime.now()

# For each article (file) loop through all the words and generate unigrams. 1175MB memory use spotted.
# 12 minutes to first output. 4200000: 4:37:24.586706 was last output.
c = 1
d1s = defaultdict(int)
for file in l:
    try:
        with open(file, encoding="utf8") as f_in:
            content = f_in.read()
    except:
        with open(file, encoding="latin-1") as f_in:
            content = f_in.read()        
    words = wordpunct_tokenize(content)    # word_tokenize handles 'n ʼn and ʼn as a single word. wordpunct_tokenize does not.
    # Take all the words from the sentence and count them.
    for i, word in enumerate(words):    
        d1s[word] = d1s[word] + 1   
    c = c + 1
    if c % 200000 == 0:
        t2 = datetime.now()
        print(str(c) + ': ' + str(t2 - t1))

t2 = datetime.now()
print('After unigram: ' + str(t2 - t1))

t1 = datetime.now()
# Sort the defaultdict in descending order and write the unigrams to a file.
# 0:00:27.740082 was output. 3285Mb memory. 165Mb output file.
d1ss = OrderedDict(sorted(d1s.items(), key=operator.itemgetter(1), reverse=True))
with open("D:\\Downloads\\Wikipedia\\en_ngram1.txt", mode="w", encoding="utf-8") as f_out:
    for k, v in d1ss.items():
        f_out.write(k + '┼' + str(v) + "\n")
t2 = datetime.now()
print('After unigram write: ' + str(t2 - t1))

# Determine the lowest 1gram count we are interested in.
low_count = 20 - 1
d1s = {}
# Get all the 1gram counts as a dict.
for word, count in d1ss.items():
    # Stop adding 1gram counts when we reach the lowest 1gram count.
    if count == low_count:
        break
    # Add the count to the dict.
    d1s[word] = count

t1 = datetime.now()

# For each article (file) loop through all the sentences and generate 2grams. 13GB memory use spotted.
# 17 minutes to first output. 4200000: 4:37:24.586706 was last output.
c = 1
d2s = defaultdict(int)
for file in l:
    try:
        with open(file, encoding="utf8") as f_in:
            content = f_in.read()
    except:
        with open(file, encoding="latin-1") as f_in:
            content = f_in.read()   
    # Extract the sentences in the file content.         
    sentences = deque()
    sentences.extend(sent_tokenize(content))            
    # Get all the words for one sentence.
    for sentence in sentences:        
        words = wordpunct_tokenize(sentence)    # word_tokenize handles 'n ʼn and ʼn as a single word. wordpunct_tokenize does not.
        # Take all the words from the sentence with high 1gram count that are next to each other and count them.
        for i, word in enumerate(words):    
            if word in d1s:
                try:
                    word2 = words[i+1]
                    if word2 in d1s:
                        gram2 = word + ' ' + word2
                        d2s[gram2] = d2s[gram2] + 1
                except:
                    pass
    c = c + 1
    if c % 200000 == 0:
        t2 = datetime.now()
        print(str(c) + ': ' + str(t2 - t1))

t2 = datetime.now()
print('After bigram: ' + str(t2 - t1))
like image 36
Superdooperhero Avatar answered Oct 07 '22 01:10

Superdooperhero