Looking for a database of n-grams taken from wikipedia

Tags:

I am effectively trying to solve the same problem as this question:

Finding related words (specifically physical objects) to a specific word

minus the requirement that words represent physical objects. The answers and edited question seem to indicate that a good start is building a list of frequency of n-grams using wikipedia text as a corpus. Before I start downloading the mammoth wikipedia dump, does anyone know if such a list already exists?

PS if the original poster of the previous question sees this, I would love to know how you went about solving the problem, as your results seem excellent :-)

732

asked Feb 24 '10 10:02

mojones

2 Answers

Google has a publicly available terabyte n-garam database (up to 5).
You can order it in 6 DVDs or find a torrent that hosts it.

187

answered Oct 07 '22 01:10

Shay Erlichmen

You can find the June 2008 Wikipedia n-grams here. In addition it also has headwords and tagged sentences. I tried to create my own n-grams, but ran out of memory (32Gb) on the bigrams (the current English Wikipedia is massive). It also took about 8 hours to extract the xml, 5 hours for unigrams and 8 hours for bigrams.

The linked n-grams also has the benefit of having been cleaned somewhat since mediawiki and Wikipedia has a lot of junk in between the text.

Here's my Python code:

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from datetime import datetime
from collections import deque
from collections import defaultdict
from collections import OrderedDict
import operator
import os

# Loop through all the English Wikipedia Article files and store their path and filename in a list. 4 minutes.
dir = r'D:\Downloads\Wikipedia\articles'
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]

t1 = datetime.now()

# For each article (file) loop through all the words and generate unigrams. 1175MB memory use spotted.
# 12 minutes to first output. 4200000: 4:37:24.586706 was last output.
c = 1
d1s = defaultdict(int)
for file in l:
    try:
        with open(file, encoding="utf8") as f_in:
            content = f_in.read()
    except:
        with open(file, encoding="latin-1") as f_in:
            content = f_in.read()        
    words = wordpunct_tokenize(content)    # word_tokenize handles 'n ʼn and ŉ as a single word. wordpunct_tokenize does not.
    # Take all the words from the sentence and count them.
    for i, word in enumerate(words):    
        d1s[word] = d1s[word] + 1   
    c = c + 1
    if c % 200000 == 0:
        t2 = datetime.now()
        print(str(c) + ': ' + str(t2 - t1))

t2 = datetime.now()
print('After unigram: ' + str(t2 - t1))

t1 = datetime.now()
# Sort the defaultdict in descending order and write the unigrams to a file.
# 0:00:27.740082 was output. 3285Mb memory. 165Mb output file.
d1ss = OrderedDict(sorted(d1s.items(), key=operator.itemgetter(1), reverse=True))
with open("D:\\Downloads\\Wikipedia\\en_ngram1.txt", mode="w", encoding="utf-8") as f_out:
    for k, v in d1ss.items():
        f_out.write(k + '┼' + str(v) + "\n")
t2 = datetime.now()
print('After unigram write: ' + str(t2 - t1))

# Determine the lowest 1gram count we are interested in.
low_count = 20 - 1
d1s = {}
# Get all the 1gram counts as a dict.
for word, count in d1ss.items():
    # Stop adding 1gram counts when we reach the lowest 1gram count.
    if count == low_count:
        break
    # Add the count to the dict.
    d1s[word] = count

t1 = datetime.now()

# For each article (file) loop through all the sentences and generate 2grams. 13GB memory use spotted.
# 17 minutes to first output. 4200000: 4:37:24.586706 was last output.
c = 1
d2s = defaultdict(int)
for file in l:
    try:
        with open(file, encoding="utf8") as f_in:
            content = f_in.read()
    except:
        with open(file, encoding="latin-1") as f_in:
            content = f_in.read()   
    # Extract the sentences in the file content.         
    sentences = deque()
    sentences.extend(sent_tokenize(content))            
    # Get all the words for one sentence.
    for sentence in sentences:        
        words = wordpunct_tokenize(sentence)    # word_tokenize handles 'n ʼn and ŉ as a single word. wordpunct_tokenize does not.
        # Take all the words from the sentence with high 1gram count that are next to each other and count them.
        for i, word in enumerate(words):    
            if word in d1s:
                try:
                    word2 = words[i+1]
                    if word2 in d1s:
                        gram2 = word + ' ' + word2
                        d2s[gram2] = d2s[gram2] + 1
                except:
                    pass
    c = c + 1
    if c % 200000 == 0:
        t2 = datetime.now()
        print(str(c) + ': ' + str(t2 - t1))

t2 = datetime.now()
print('After bigram: ' + str(t2 - t1))

answered Oct 07 '22 01:10

Superdooperhero

Related questions
                            
                                How to remove punctuation?
                            
                                How to visualize attention weights?
                            
                                Tagging a single word with the nltk pos tagger tags each letter instead of the word
                            
                                Coreference Resolution using OpenNLP
                            
                                Mapping word vector to the most similar/closest word using spaCy
                            
                                TfidfVectorizer in sklearn how to specifically INCLUDE words
                            
                                PyCharm can't find Spacy Model 'en'
                            
                                How to find the closest word to a vector using BERT
                            
                                scikit cosine_similarity vs pairwise_distances
                            
                                How to get started on Information Extraction?
                            
                                C++ - How to read Unicode characters( Hindi Script for e.g. ) using C++ or is there a better Way through some other programming language?
                            
                                Programming tips with Japanese Language/Characters [closed]
                            
                                How to stop NLTK from outputting to terminal when downloading data?
                            
                                What is the most accurate open-source tool for sentence splitting? [closed]
                            
                                How to create corpus or corpora for classifying text in NLTK? [duplicate]
                            
                                Code example for Sentiment Analysis for Asian languages - Python NLTK
                            
                                Input 0 of layer lstm_5 is incompatible with the layer: expected ndim=3, found ndim=2
                            
                                Handling \u200b (Zero width space) character in text preprocessing for NLP task
                            
                                HuggingFace BERT `inputs_embeds` giving unexpected result

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Looking for a database of n-grams taken from wikipedia

Tags:

semantics

nlp

wikipedia

mojones

People also ask

2 Answers

Shay Erlichmen

Superdooperhero

Recent Activity

Donate For Us