I am somewhat of a coding novice, and I have been trying to scrape Andre 3000's lyrics off Rap genius, http://genius.com/artists/Andre-3000, by using Beautiful Soup (A Python library for pulling data out of HTML and XML files). My end goal is to have the data in a string format. Here is what I have so far:
from bs4 import BeautifulSoup
from urllib2 import urlopen
artist_url = "http://rapgenius.com/artists/Andre-3000"
def get_song_links(url):
html = urlopen(url).read()
# print html
soup = BeautifulSoup(html, "lxml")
container = soup.find("div", "container")
song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")]
print song_links
get_song_links(artist_url)
for link in soup.find_all('a'):
print(link.get('href'))
So I need help with the rest of the code. How do I get his lyrics into string format? and then how do I use the Natural Language Toolkit (NLTK) to token the sentences and words.
Here's an example, how to grab all of the song links on the page, follow them and get the song lyrics:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = "http://genius.com"
artist_url = "http://genius.com/artists/Andre-3000/"
response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
soup = BeautifulSoup(response.text, "lxml")
for song_link in soup.select('ul.song_list > li > a'):
link = urljoin(BASE_URL, song_link['href'])
response = requests.get(link)
soup = BeautifulSoup(response.text)
lyrics = soup.find('div', class_='lyrics').text.strip()
# tokenize `lyrics` with nltk
Note that requests
module is used here. Also note that User-Agent
header is required since the site returns 403 - Forbidden
without it.
First, for each link you will need to download that page and parse it with BeautifulSoup. Then look for a distinguishing attribute on that page that separates lyrics from other page content. I found <a data-editorial-state="accepted" data-classification="accepted" data-group="0"> to be helpful. Then run a .find_all on the lyrics page content to get all lyric lines. For each line you can call .get_text() to get the text from each lyrics line.
As for NLTK, once it is installed you can import it and parse sentences like so:
from nltk.tokenize import word_tokenize, sent_tokenize
words = [word_tokenize(t) for t in sent_tokenize(lyric_text)]
This will give you a list of all words in each sentence.
Even if you can scrape the site, doesn't mean that you should, instead you can use the API from genius , just create the access token from the Genius API site
import lyricsgenius as genius #calling the API
api=genius.Genius('youraccesstokenhere12345678901234567890isreallylongiknow')
artist=api.search_artist('The artist name here')
aux=artist.save_lyrics(format='json', filename='artist.txt',overwrite=True, skip_duplicates=True,verbose=True)#you can change parameters acording to your needs,i dont recommend using this file directly because it saves a lot of data that you might not need and will take more time to clean it
titles=[song['title'] for song in aux['songs']]#in this case for example i just want title and lyrics
lyrics=[song['lyrics'] for song in aux['songs']]
thingstosave=[]
for i in range(0,128):
thingstosave.append(titles[i])
thingstosave.append(lyrics[i])
with open("C:/whateverfolder/alllyrics.txt","w") as output:
output.write(str(thingstosave))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With