Retrieving data from a yaml file based on a Python list

Tags:

I'm working in ipython; I have a Yaml file and a list of [thomas] ids corresponding to my Yaml file (thomas: -third row down on the file). Below is just a small snippet of the file. The complete file can be found here (https://github.com/108michael/congress-legislators/blob/master/legislators-historical.yaml)

   - id:
    bioguide: C000858
    thomas: '00246'
    lis: S215
    govtrack: 300029
    opensecrets: N00002091
    votesmart: 53288
    icpsr: 14809
    fec:
    - S0ID00057
    wikipedia: Larry Craig
    house_history: 11530
  name:
    first: Larry
    middle: E.
    last: Craig
  bio:
    birthday: '1945-07-20'
    gender: M
    religion: Methodist
  terms:
  - type: rep
    start: '1981-01-05'
    end: '1983-01-03'
    state: ID
    district: 1
    party: Republican
  - type: rep
    start: '1983-01-03'
    end: '1985-01-03'
    state: ID
    district: 1
    party: Republican

I want to parse the file and for every id in my list that corresponds to an Id in [thomas:] I want to retrieve the following: [fec]: (there could be more than one of these, I need all of them) [name:] [first:] [middle:] [last:]; [bio:] [birthday:]; [terms:] (it is likely that there is more than one term, I need for all terms) [type:] [start:] [state:] [party:]. Finally, there may also be instances where the fec data is not available.

1) How should I store the data? I am still relatively new to Python (my first programing language) and am not sure how to store the data. Intuitively, I would say dictionary; however what is paramount is ease of access and data retrieval. Previously, I have stored similarly nested data as csv. This method seems a little bit bulky. It seems that it would be ideal if I could just make a list (from the thomas ids that I have) of dictionaries (the data I am retrieving).

2) I'm not sure how to set up the for/while statements so that I only retrieve data corresponding to my list of thomas ids.

I started with writing what I expect would be the code for writing the info to CSV:

import pandas as pd
import yaml
import glob
import CSV
df = pd.concat((pd.read_csv(f, names=['date','bill_id','sponsor_id']) for f in glob.glob('/home/jayaramdas/anaconda3/df/s11?_s_b')))

outputfile = open('sponsor_details', 'W', newline='')
outputwriter = csv.writer(outputfile)

df = df.drop_duplicates('sponsor_id')
sponsor_list = df['sponsor_id'].tolist()

with open('legislators-historical.yaml', 'r') as f:
    data = yaml.load(f)

    for sponsor in sponsor_list:
        where sponsor == data[0]['thomas']:
            x = data[0]['thomas']
            a = data[0]['name']['first']
            b = data[0]['name']['middle']
            c = data[0]['name']['last']
            d = data[0]['bio']['gender']
            e = data[0]['bio']['religion']

            for fec in data[0]['id']:
                c = fec.get('fec')    

                for terms in data[0]['id']:
                    t = terms.get('type')  
                    s = terms.get('start')  
                    state = terms.get('state')
                    p = terms.get('party')

    outputwriter.writerow([x, a, b, c, d, e, c, t, s, state, p])
    outputfile.flush()

I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-48-057d25de7e11> in <module>()
     15 
     16     for sponsor in sponsor_list:
---> 17         if sponsor == data[0]['thomas']:
     18             x = data[0]['thomas']
     19             a = data[0]['name']['first']

KeyError: 'thomas'

881

asked Mar 13 '16 08:03

Collective Action

1 Answers

I think you may try to parse YAML and load it to data frame, normalizing it:

import pandas as pd
from yaml import safe_load

with open('legislators-historical.yaml', 'r') as f:
    df = pd.json_normalize(safe_load(f))

print(df.head())

Output:

  bio.birthday bio.gender bio.religion id.bioguide       id.fec  id.govtrack  \
0   1943-12-02          M   Protestant     A000109  [S6CO00168]       300003
1   1745-04-02          M          NaN     B000226          NaN       401222
2   1742-03-21          M          NaN     B000546          NaN       401521
3   1743-06-16          M          NaN     B001086          NaN       402032
4   1730-07-22          M          NaN     C000187          NaN       402334

   id.house_history  id.icpsr id.lis id.opensecrets id.thomas  id.votesmart  \
0              8410     29108   S250      N00009082     00011         26783
1               NaN       507    NaN            NaN       NaN           NaN
2              9479       786    NaN            NaN       NaN           NaN
3             10177      1260    NaN            NaN       NaN           NaN
4             10687      1538    NaN            NaN       NaN           NaN

     id.wikipedia  name.first name.last name.middle  \
0    Wayne Allard       Wayne    Allard          A.
1             NaN     Richard   Bassett         NaN
2             NaN  Theodorick     Bland         NaN
3   Aedanus Burke     Aedanus     Burke         NaN
4  Daniel Carroll      Daniel   Carroll         NaN

                                               terms
0  [{'party': 'Republican', 'type': 'rep', 'state...
1  [{'party': 'Anti-Administration', 'type': 'sen...
2  [{'end': '1791-03-03', 'district': 9, 'type': ...
3  [{'end': '1791-03-03', 'district': 2, 'type': ...
4  [{'end': '1791-03-03', 'district': 6, 'type': ...

UPDATE:

the following version will filter your input data so only records containing "thomas" and "fec" will be processed:

import pandas as pd
from yaml import safe_load

def read_yaml(fn):
    with open(fn, 'r') as fi:
        return safe_load(fi)

def filter_data(data):
    result_data = []
    for x in data:
        if 'id' not in x:   continue
        if 'fec' not in x['id']:    continue
        if 'thomas' not in x['id']: continue
        result_data.append(x)
    return result_data


fn = 'aaa.yaml'


df = pd.json_normalize(filter_data(read_yaml(fn)), 'terms', [['id', 'fec'], ['id', 'thomas']])
print(df.head())

df.to_csv('out.csv')

Output:

   class  district         end       party       start state type  \
0    NaN         4  1993-01-03  Republican  1991-01-03    CO  rep
1    NaN         4  1995-01-03  Republican  1993-01-05    CO  rep
2    NaN         4  1997-01-03  Republican  1995-01-04    CO  rep
3      2       NaN  2003-01-03  Republican  1997-01-07    CO  sen
4      2       NaN  2009-01-03  Republican  2003-01-07    CO  sen

                        url id.thomas     id.fec
0                       NaN     00011  S6CO00168
1                       NaN     00011  S6CO00168
2                       NaN     00011  S6CO00168
3                       NaN     00011  S6CO00168
4  http://allard.senate.gov     00011  S6CO00168

PS as you see this will duplicate your rows (see: id.thomas and id.fec) so that it can be shown as a data frame

UPDATE2

You may also want to convert lists in 'id.fec' into columns, but i would do it in additional data frame:

df_fec = df['id.fec'].apply(pd.Series)

print(df_fec.head())

Output:

           0          1
0  S8AR00112  H2AR01022
1  S8AR00112  H2AR01022
2  S8AR00112  H2AR01022
3  S8AR00112  H2AR01022
4  S6CO00168        NaN

answered Sep 30 '22 10:09

MaxU - stop WAR against UA

Related questions
                            
                                Pycharm : import Boto 3
                            
                                How to extract equation from a polynomial fit?
                            
                                Python - Read data from netCDF file with time as "seconds since" beginning of measurement
                            
                                XPathEvalError: Unregistered function for matches() in lxml
                            
                                How to invoke Lambda function with Event Invocation Type via API Gateway?
                            
                                "If not" condition statement in python [duplicate]
                            
                                Discrete fourier transformation from a list of x-y points
                            
                                What argument can we pass to super()?
                            
                                Optimize Display for Django WebApp depending on Mobile Device vs Desktop [closed]
                            
                                Ignore dates and times while parsing YAML?
                            
                                Pythonic way to use range with excluded last number?
                            
                                How to filter stdout in python logging
                            
                                How do I replace a closed event loop?
                            
                                Python - is there a way to make all strings unicode in a project by default?
                            
                                Cookies must be enabled in your browser [Python Requests]
                            
                                Using Python Higher Order Functions to Manipulate Lists
                            
                                python opencv cv2 matchTemplate with transparency
                            
                                How to change screen transition in different screens
                            
                                Lambda and S3 Permission denied when want to create file
                            
                                What parameters does Django's models.DO_NOTHING expect?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Retrieving data from a yaml file based on a Python list

Tags:

python

pandas

Collective Action

People also ask

1 Answers

MaxU - stop WAR against UA

Recent Activity

Donate For Us