Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting data from a URL result with special formatting

I have a URL:
http://somewhere.com/relatedqueries?limit=2&query=seedterm

where modifying the inputs, limit and query, will generate wanted data. Limit is the max number of term possible and query is the seed term.

The URL provides text result formatted in this way:
oo.visualization.Query.setResponse({version:'0.5',reqId:'0',status:'ok',sig:'1303596067112929220',table:{cols:[{id:'score',label:'Score',type:'number',pattern:'#,##0.###'},{id:'query',label:'Query',type:'string',pattern:''}],rows:[{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm1'}]},{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm2'}]}],p:{'totalResultsCount':'7727'}}});

I'd like to write a python script that takes two arguments (limit number and the query seed), go fetch the data online, parse the result and return a list with the new terms ['newterm1','newterm2'] in this case.

I'd love some help, especially with the URL fetching since I have never done this before.

like image 637
datayoda Avatar asked Dec 12 '22 18:12

datayoda


2 Answers

It sounds like you can break this problem up into several subproblems.

Subproblems

There are a handful of problems that need to be solved before composing the completed script:

  1. Forming the request URL: Creating a configured request URL from a template
  2. Retrieving data: Actually making the request
  3. Unwrapping JSONP: The returned data appears to be JSON wrapped in a JavaScript function call
  4. Traversing the object graph: Navigating through the result to find the desired bits of information

Forming the request URL

This is just simple string formatting.

url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')

Python 2 Note

You will need to use the string formatting operator (%) here.

url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')

Retrieving data

You can use the built-in urllib.request module for this.

import urllib.request
data = urllib.request.urlopen(url) # url from previous section

This returns a file-like object called data. You can also use a with-statement here:

with urllib.request.urlopen(url) as data:
    # do processing here

Python 2 Note

Import urllib2 instead of urllib.request.

Unwrapping JSONP

The result you pasted looks like JSONP. Given that the wrapping function that is called (oo.visualization.Query.setResponse) doesn't change, we can simply strip this method call out.

result = data.read()

prefix = 'oo.visualization.Query.setResponse('
suffix = ');'

if result.startswith(prefix) and result.endswith(suffix):
    result = result[len(prefix):-len(suffix)]

Parsing JSON

The resulting result string is just JSON data. Parse it with the built-in json module.

import json

result_object = json.loads(result)

Traversing the object graph

Now, you have a result_object that represents the JSON response. The object itself be a dict with keys like version, reqId, and so on. Based on your question, here is what you would need to do to create your list.

# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]

Putting it all together

#!/usr/bin/env python3

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib.request
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
    url = url_template.format(limit=limit, seedterm=seedterm)

    try:
        with urllib.request.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        print('Could not request data from server', file=sys.stderr)
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print(terms)

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print(term)

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        print(error_message, file=sys.stderr)
        exit(2)

    exit(main(limit, seedterm))

Python 2.7 version

#!/usr/bin/env python2.7

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib2
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
    url = url_template % dict(limit=2, seedterm='seedterm')

    try:
        with urllib2.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        sys.stderr.write('%s\n' % 'Could not request data from server')
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print terms

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print term

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        sys.stderr.write('%s\n' % error_message)
        exit(2)

    exit(main(limit, seedterm))
like image 83
Wesley Avatar answered Apr 27 '23 03:04

Wesley


i didn't understand well your problem because from your code there it seem to me that you use Visualization API (it's the first time that i hear about it by the way).

But well if you are just searching for a way to fetch data from a web page you could use urllib2 this is just for getting data, and if you want to parse the retrieved data you will have to use a more appropriate library like BeautifulSoop

if you are dealing with another web service (RSS, Atom, RPC) rather than web pages you can find a bunch of python library that you can use and that deal with each service perfectly.

import urllib2

from BeautifulSoup import BeautifulSoup

result =  urllib2.urlopen('http://somewhere.com/relatedqueries?limit=%s&query=%s' % (2, 'seedterm'))

htmletxt = resul.read()

result.close()

soup = BeautifulSoup(htmltext, convertEntities="html" )

# you can parse your data now check BeautifulSoup API.
like image 32
mouad Avatar answered Apr 27 '23 04:04

mouad