Find Most Common Words from a Website in Python 3 [closed]

Question

I need to find and copy those words that appears over 5 times on a given website using Python 3 code and I'm not sure how to do it. I've looked through the archives here on stack overflow but other solutions rely on python 2 code. Here's the measly code I have so far:

   from urllib.request import urlopen
   website = urllib.urlopen("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

Does anyone have any advice on what to do? I have NLTK installed and I've looked into beautiful soup but for the life of me, I have no idea how to install it correctly (I'm very python-green)! As I am learning, any explanation would also be very much appreciated. Thank you :)

Padraic Cunningham · Accepted Answer

This is not perfect but an idea of how to get you started using requests, BeautifulSoup and collections.Counter

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

r = requests.get("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

soup = BeautifulSoup(r.content)

text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))

c = Counter((x.rstrip(punctuation).lower() for y in text for x in y.split()))
print (c.most_common()) # prints most common words staring at most common.

[('the', 279), ('and', 192), ('in', 175), ('of', 168), ('his', 140), ('a', 124), ('to', 103), ('mozart', 82), ('was', 77), ('he', 70), ('with', 53), ('as', 50), ('for', 40), ("mozart's", 39), ('on', 35), ('from', 34), ('at', 31), ('by', 31), ('that', 26), ('is', 23), ('k.', 21), ('an', 20), ('had', 20), ('were', 20), ('but', 19), ('which',.............

print ([x for x in c if c.get(x) > 5]) # words appearing more than 5 times

['there', 'but', 'both', 'wife', 'for', 'musical', 'salzburg', 'it', 'more', 'first', 'this', 'symphony', 'wrote', 'one', 'during', 'mozart', 'vienna', 'joseph', 'in', 'later', 'salzburg,', 'other', 'such', 'last', 'needed]', 'only', 'their', 'including', 'by', 'music,', 'at', "mozart's", 'mannheim,', 'composer', 'and', 'are', 'became', 'four', 'premiered', 'time', 'did', 'the', 'not', 'often', 'is', 'have', 'began', 'some', 'success', 'court', 'that', 'performed', 'work', 'him', 'leopold', 'these', 'while', 'been', 'new', 'most', 'were', 'father', 'opera', 'as', 'who', 'classical', 'k.', 'to', 'of', 'has', 'many', 'was', 'works', 'which', 'early', 'three', 'family', 'on', 'a', 'when', 'had', 'december', 'after', 'he', 'no.', 'year', 'from', 'great', 'period', 'music', 'with', 'his', 'composed', 'minor', 'two', 'number', '1782', 'an', 'piano']

Kyle Me · Answer

So, this is coming from a newbie, but if you just need a quick answer, I think this might work. Please note that with this method, you cannot just put in the URL with the program, you have to manually paste it in the code. (sorry).

text = '''INSERT TEXT HERE'''.split() #Where you see "INSERT TEXT HERE", that's where the text goes.
#also note the .split() method at the end. This converts the text into a list, splitting every word in between the spaces. 
#for example, "red dog food".split() would be ['red','dog','food']
overusedwords = [] #this is where the words that are used 5 or more times are going to be held.
for i in text: #this will iterate through every single word of the text
    if text.count(i) >= 5 and overusedwords.count(i) == 0: #(1. Read below)
        overusedwords.append(i) #this adds the word to the list of words used 5 or more times
if len(overusedwords) > 0: #if there are no words used 5 or more times, it doesn't print anything useless.
    print('The overused words are:')
    for i in overusedwords:
        print(i)
else:
    print('No words used 5 or more times.') #just in case there are no words used 5 or more times

For the explanation of the "text.count(i) >= 5 part. For every time it iterates through the for loop, it checks to see if there are five or more of that specific word used in the text. Then, for "and overusedwords.count(i) == 0:", this just makes sure that the same word isn't being added twice to the list of overused words. Hope I helped. I'm thinking that you might have wanted a method where you could get this information straight from typing in the url, but this might help other beginners that have a similar question.

Find Most Common Words from a Website in Python 3 [closed]

Tags:

python

beautifulsoup

nltk

web-crawler

user3682157

2 Answers

Padraic Cunningham

Kyle Me

Recent Activity

Donate For Us

Find Most Common Words from a Website in Python 3 [closed]

Tags:

python

beautifulsoup

nltk

web-crawler

user3682157

2 Answers

Padraic Cunningham

Kyle Me

Related questions

Recent Activity

Donate For Us