Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating pandas dataframe from API call

I'm building an API to retrieve Census data, but I'm having trouble formatting the output. My question is really one of two:

1) How can I improve my API call so that the output is prettier (ideally a dataframe)

or

2) How can I manipulate the list that I currently get so that it is in a pandas dataframe?

Here is what I have so far:

import requests
import pandas as pd
import numpy as np

mytoken = "numbersandletters" 
# this is my API key, so unfortunately I can't provide it

def state_data(token, variables, year = 2010, state = "*", survey = "sf1"):
    state = [str(i) for i in state]
    # make sure the input for state (integers) are strings
  variables = ",".join(variables) # squish all the variables into one string
  year = str(year)
  combine = ["http://api.census.gov/data/", year, "/", survey, "?key=", mytoken, "&get=", variables, "&for=state:"] 
# make a list of all the components to construct a URL
  incomplete_url = "".join(combine) # the URL without the state tackd on to the end
  complete_url = map(lambda i: incomplete_url + i, state) # now the state is tacked on to the end; one URL per state or for "*"
  r = []
  r = map(lambda i: requests.get(i), complete_url) 
# make an API call to each complete_url
  data = map(lambda i: i.json(), r)
print r
print data 
print type(data)
df = pd.DataFrame(data)
print df

An example of calling the function is this, with the output below.

state_data(token = mytoken, state = [47, 48, 49, 50], variables = ["P0010001", "P0010001"])

resulting in:

[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]


[[[u'P0010001', u'P0010001', u'state'], [u'6346105', u'6346105', u'47']], 
[[u'P0010001', u'P0010001', u'state'], [u'25145561', u'25145561', u'48']], 
[[u'P0010001', u'P0010001', u'state'], [u'2763885', u'2763885', u'49']], 
[[u'P0010001', u'P0010001', u'state'], [u'625741', u'625741', u'50']]]

<type 'list'>
                         0                         1
0  [P0010001, P0010001, state]    [6346105, 6346105, 47]
1  [P0010001, P0010001, state]  [25145561, 25145561, 48]
2  [P0010001, P0010001, state]    [2763885, 2763885, 49]
3  [P0010001, P0010001, state]      [625741, 625741, 50]

Whereas the desired outcome would be:

  P0010001  P0010001  state
0 6346105   6346105   47
1 25145561  25145561  48
2 2763885   2763885   49
3 625741    625741    50

Fwiw, the analogous code in R is below. I'm translating a library I've written in R to Python:

state.data = function(token, state = "*", variables, year = 2010, survey = "sf1"){
  state = as.character(state)
  variables = paste(variables, collapse = ",")
  year = as.character(year)
  my.url = matrix(paste("http://api.census.gov/data/", year, "/", survey, "?key=", token,
                    "&get=",variables, "&for=state:", state, sep = ""), ncol = 1)

  process.url = apply(my.url, 1, function(x)   process.api.data(fromJSON(file=url(x))))
  rbind.dat = data.frame(rbindlist(process.url))
  rbind.dat = rbind.dat[, c(tail(seq_len(ncol(rbind.dat)), 1), seq_len(ncol(rbind.dat) - 1))] 
  rbind.dat
}
like image 553
Nancy Avatar asked Oct 31 '22 10:10

Nancy


1 Answers

so you have duplicate fields, which is nonsensical, and your result will only show one of the duplicated fields.

however, all you need to do is pass a list/iterable of dict objects to the pd.DataFrame constructor, and you'll have your results:

vals = [[[...]]]  # the data you provided in your example
df = pd.DataFrame(dict(zip(*v)) for v in vals)

assuming this is your data:

data = [["P0010001","PCO0020019","state"], ["4779736","1204","01"], ["710231","53","02"], ["6392017","799","04"], ["2915918","924","05"], ["37253956","6244","06"], ["5029196","955","08"], ["3574097","1266","09"], ["897934","266","10"], ["601723","170","11"], ["18801310","4372","12"], ["9687653","1629","13"], ["1360301","251","15"], ["1567582","320","16"], ["12830632","3713","17"]]

then this works:

df = pd.DataFrame(data[1:], columns=data[0])

so you'll need to figure out how to get the data into that form. all i'm doing is passing a list of lists (data[1:]) and a list (data[0])

like image 200
acushner Avatar answered Nov 04 '22 09:11

acushner