Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web Scraping Rap lyrics on Rap Genius w/ Python

I am somewhat of a coding novice, and I have been trying to scrape Andre 3000's lyrics off Rap genius, http://genius.com/artists/Andre-3000, by using Beautiful Soup (A Python library for pulling data out of HTML and XML files). My end goal is to have the data in a string format. Here is what I have so far:

from bs4 import BeautifulSoup
from urllib2 import urlopen

artist_url = "http://rapgenius.com/artists/Andre-3000"

def get_song_links(url):
    html = urlopen(url).read()
    # print html 
    soup = BeautifulSoup(html, "lxml")
    container = soup.find("div", "container")
    song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")]

    print song_links

get_song_links(artist_url)
for link in soup.find_all('a'):
    print(link.get('href'))

So I need help with the rest of the code. How do I get his lyrics into string format? and then how do I use the Natural Language Toolkit (NLTK) to token the sentences and words.

like image 441
Ibrewster Avatar asked Jul 21 '14 20:07

Ibrewster


3 Answers

Here's an example, how to grab all of the song links on the page, follow them and get the song lyrics:

from urlparse import urljoin
from bs4 import BeautifulSoup
import requests


BASE_URL = "http://genius.com"
artist_url = "http://genius.com/artists/Andre-3000/"

response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})

soup = BeautifulSoup(response.text, "lxml")
for song_link in soup.select('ul.song_list > li > a'):
    link = urljoin(BASE_URL, song_link['href'])
    response = requests.get(link)
    soup = BeautifulSoup(response.text)
    lyrics = soup.find('div', class_='lyrics').text.strip()

    # tokenize `lyrics` with nltk

Note that requests module is used here. Also note that User-Agent header is required since the site returns 403 - Forbidden without it.

like image 87
alecxe Avatar answered Nov 10 '22 09:11

alecxe


First, for each link you will need to download that page and parse it with BeautifulSoup. Then look for a distinguishing attribute on that page that separates lyrics from other page content. I found <a data-editorial-state="accepted" data-classification="accepted" data-group="0"> to be helpful. Then run a .find_all on the lyrics page content to get all lyric lines. For each line you can call .get_text() to get the text from each lyrics line.

As for NLTK, once it is installed you can import it and parse sentences like so:

from nltk.tokenize import word_tokenize, sent_tokenize
words = [word_tokenize(t) for t in sent_tokenize(lyric_text)]

This will give you a list of all words in each sentence.

like image 39
Andrew Johnson Avatar answered Nov 10 '22 08:11

Andrew Johnson


Even if you can scrape the site, doesn't mean that you should, instead you can use the API from genius , just create the access token from the Genius API site

import lyricsgenius as genius #calling the API
api=genius.Genius('youraccesstokenhere12345678901234567890isreallylongiknow')
artist=api.search_artist('The artist name here')
aux=artist.save_lyrics(format='json', filename='artist.txt',overwrite=True, skip_duplicates=True,verbose=True)#you can change parameters acording to your needs,i dont recommend using this file directly because it saves a lot of data that you might not need and will take more time to clean it

titles=[song['title'] for song in aux['songs']]#in this case for example i just want title and lyrics
lyrics=[song['lyrics'] for song in aux['songs']]
thingstosave=[]
for i in range(0,128):
    thingstosave.append(titles[i])
    thingstosave.append(lyrics[i])
with open("C:/whateverfolder/alllyrics.txt","w") as output:
    output.write(str(thingstosave))
like image 38
Julian Abril Avatar answered Nov 10 '22 07:11

Julian Abril