Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get data from multiple URLs using Python

I'm trying to do the following-

  1. Go to a web page, enter a search term.
  2. Get some data from it.
  3. It in turn has multiple URLs in it. I need to parse each one of them to get some data out of them.

I can do 1 and 2. I do not understand how I can go to all the URLs and get data (which is similar in all the URLs, but not the same) from them.

EDIT: More information- I input the search terms from a csv file, get a few IDs (with URLs) from each page. I'd like to go to all these URLs to get more IDs from the following page. I want to write all these into a CSV file. Basically, I want my output to be something like this

Level1 ID1   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             .
             .
             .
             Level2 IDN   Level3 ID
Level1 ID2   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             .
             .
             .
             Level2 IDN   Level3 ID

There can be multiple Level2 IDs for each Level1 ID. But there will be only one corresponding Level3 ID for each Level2 ID.

CODE that I've written so far:

import pandas as pd
from bs4 import BeautifulSoup
from urllib import urlopen

colnames = ['A','B','C','D']
data = pd.read_csv('file.csv', names=colnames)
listofdata= list(data.A)
id = '\n'.join(listofdata[1:]) #to skip header


def download_gsm_number(gse_id):
    url = "http://www.example.com" + id
    readurl = urlopen(url)
    soup = BeautifulSoup(readurl)
    soup1 = str(soup)
    gsm_data = readurl.read()
    #url_file_handle.close()
    pattern=re.compile(r'''some(.*?)pattern''')  
    data = pattern.findall(soup1)
    col_width = max(len(word) for row in data for word in row)
    for row in data:
        lines = "".join(row.ljust(col_width))
        sequence = ''.join([c for c in lines])
        print sequence

But this is taking all the ids at once into the URL. As I mentioned before, I need to get level2 ids from the level1 ids given as input. Further, from level2 ids, I need level3 ids. Basically, if I get just one part (getting either level2 or level3 ids) from it, I can figure out the rest.

like image 709
user3783999 Avatar asked Oct 28 '25 17:10

user3783999


1 Answers

I believe your answer is urllib.

It is actually as easy as going:

web_page = urllib.urlopen(url_string)

And then with that you can do normal file operations such as:

read()
readline()
readlines()
fileno()
close()
info()
getcode()
geturl()

From there I would suggest using BeautifulSoup to parse which is as easy as:

soup = BeautifulSoup(web_page.read())

And then you can do all the wonderful BeautifulSoup operations on it.

I would imagine Scrapy is overkill and there is a lot more overhead involved. BeautifulSoup has some great documentation, examples, and is just plain easy to use.

like image 58
clifgray Avatar answered Oct 31 '25 07:10

clifgray



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!