I'm trying to do the following-
I can do 1 and 2. I do not understand how I can go to all the URLs and get data (which is similar in all the URLs, but not the same) from them.
EDIT: More information- I input the search terms from a csv file, get a few IDs (with URLs) from each page. I'd like to go to all these URLs to get more IDs from the following page. I want to write all these into a CSV file. Basically, I want my output to be something like this
Level1 ID1 Level2 ID1 Level3 ID
Level2 ID2 Level3 ID
.
.
.
Level2 IDN Level3 ID
Level1 ID2 Level2 ID1 Level3 ID
Level2 ID2 Level3 ID
.
.
.
Level2 IDN Level3 ID
There can be multiple Level2 IDs for each Level1 ID. But there will be only one corresponding Level3 ID for each Level2 ID.
CODE that I've written so far:
import pandas as pd
from bs4 import BeautifulSoup
from urllib import urlopen
colnames = ['A','B','C','D']
data = pd.read_csv('file.csv', names=colnames)
listofdata= list(data.A)
id = '\n'.join(listofdata[1:]) #to skip header
def download_gsm_number(gse_id):
url = "http://www.example.com" + id
readurl = urlopen(url)
soup = BeautifulSoup(readurl)
soup1 = str(soup)
gsm_data = readurl.read()
#url_file_handle.close()
pattern=re.compile(r'''some(.*?)pattern''')
data = pattern.findall(soup1)
col_width = max(len(word) for row in data for word in row)
for row in data:
lines = "".join(row.ljust(col_width))
sequence = ''.join([c for c in lines])
print sequence
But this is taking all the ids at once into the URL. As I mentioned before, I need to get level2 ids from the level1 ids given as input. Further, from level2 ids, I need level3 ids. Basically, if I get just one part (getting either level2 or level3 ids) from it, I can figure out the rest.
I believe your answer is urllib.
It is actually as easy as going:
web_page = urllib.urlopen(url_string)
And then with that you can do normal file operations such as:
read()
readline()
readlines()
fileno()
close()
info()
getcode()
geturl()
From there I would suggest using BeautifulSoup to parse which is as easy as:
soup = BeautifulSoup(web_page.read())
And then you can do all the wonderful BeautifulSoup operations on it.
I would imagine Scrapy is overkill and there is a lot more overhead involved. BeautifulSoup has some great documentation, examples, and is just plain easy to use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With