Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to scrape all files in a catalog series from the national archives (archives.gov) with R

i'm looking for a programmatic way to scrape all available files for a data file series on archive.gov with R. archives.gov appears to use javascript. my goal is to capture the url of each file available, as well as the file's name.

the home mortgage disclosure act data file series has 153 entries

in a browser, i can click the "export" button and get a csv file with this structure:

first_exported_record <-    
    structure(list(resultType = structure(1L, .Label = "fileUnit", class = "factor"), 
    creators.0 = structure(1L, .Label = "Federal Reserve System. Board of Governors. Division of Consumer and Community Affairs. ca. 1981- (Most Recent)", class = "factor"), 
    date = structure(1L, .Label = "1981 - 2013", class = "factor"), 
    documentIndex = 1L, from.0 = structure(1L, .Label = "Series: Home Mortgage Disclosure Data Files, 1981 - 2013", class = "factor"), 
    from.1 = structure(1L, .Label = "Record Group 82: Records of the Federal Reserve System, 1913 - 2003", class = "factor"), 
    location.locationFacility1.0 = structure(1L, .Label = "National Archives at College Park - Electronic Records(RDE)", class = "factor"), 
    location.locationFacility1.1 = structure(1L, .Label = "National Archives at College Park", class = "factor"), 
    location.locationFacility1.2 = structure(1L, .Label = "8601 Adelphi Road", class = "factor"), 
    location.locationFacility1.3 = structure(1L, .Label = "College Park, MD, 20740-6001", class = "factor"), 
    location.locationFacility1.4 = structure(1L, .Label = "Phone: 301-837-0470", class = "factor"), 
    location.locationFacility1.5 = structure(1L, .Label = "Fax: 301-837-3681", class = "factor"), 
    location.locationFacility1.6 = structure(1L, .Label = "Email: [email protected]", class = "factor"), 
    naId = 18491490L, title = structure(1L, .Label = "Non-restricted Ultimate Loan Application Register (LAR) Data, 2012", class = "factor"), 
    url = structure(1L, .Label = "https://catalog.archives.gov/id/18491490", class = "factor")), .Names = c("resultType", 
    "creators.0", "date", "documentIndex", "from.0", "from.1", "location.locationFacility1.0", 
    "location.locationFacility1.1", "location.locationFacility1.2", 
    "location.locationFacility1.3", "location.locationFacility1.4", 
    "location.locationFacility1.5", "location.locationFacility1.6", 
    "naId", "title", "url"), class = "data.frame", row.names = c(NA, 
    -1L))

and then behind each of those 153 entries, there are file unit pages with multiple files available for download. for example, that first exported record points to:

https://catalog.archives.gov/id/18491490

but both of these pages appear to be javascript, so i'm not sure if i need something like phantomjs or selenium, or if there's some trick to export the catalog with simpler tools like rvest?

at the point that i know each file's url, i can download them without issue:

tf <- tempfile()
download.file( "https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/233_32LU_TSS.pdf?download=false" , tf , mode = 'wb' )

and this file name would be

"Technical Specifications Summary, 2012 Ultimate LAR."

thanks!

update:

the specific question is how do i programmatically get from the first link (the series ID) to the titles and urls of all files available for download within the series. i tried rvest and httr commands with nothing useful to show for it.. :/ thanks

like image 467
Anthony Damico Avatar asked Jan 26 '18 14:01

Anthony Damico


People also ask

How do I search the National Archives catalog?

Archives.gov – searches web pages on Archives.gov. Authority Records – searches Organization and Person Name authority records. Archival Descriptions with Digital Objects – searches catalog records with a digital copy of the records available online.

What is a catalog in an archive?

Cataloguing is the process of writing a description for the records that you are transferring. The records can be searched later using descriptions and titles within catalogues.

What items are stored in the National Archives?

The National Archives holds historical U.S. government documents (federal, congressional, and presidential records) that are created or received by the President and his staff, by Congress, by employees of Federal government agencies, and by the Federal courts in the course of their official duties.


3 Answers

There's no need to load and parse the page here since the records are loaded via a simple Ajax request.

To see the requests, simply monitor them via devtools and select the first one returning some JSON. Then use the jsonlite library to request the same URL with R. It will automatically parse the result.

To list all the files (description + URL) for the 153 entries:

library(jsonlite)
options(timeout=60000) # increase timeout to 60sec (default is 10sec)

json = fromJSON("https://catalog.archives.gov/OpaAPI/iapi/v1?action=search&f.level=fileUnit&f.parentNaId=2456161&q=*:*&offset=0&rows=10000&tabType=all")
ids = json$opaResponse$results$result$naId

for (id in ids) { # each id
    json = fromJSON(sprintf("https://catalog.archives.gov/OpaAPI/iapi/v1/id/%s", id))
    records = json$opaResponse$content$objects$objects$object

    for (r in 1:nrow(records)) {  # each record
        # prints the file description and URL
        print(records[r, 'description'])
        print(records[r, '@renditionBaseUrl'])
    }
}
like image 185
Florent B. Avatar answered Nov 15 '22 02:11

Florent B.


If you are familiar with using httr, you might consider using the National Archives Catalog API to interact with their server. As I read that web site there is a way to query and request data directly. This way you would not have to scrape the web page.

I played around in the sandbox without an api key and got this far translating your webpage query to the api query:

https://catalog.archives.gov/api/v1?&q=*:*&resultTypes=fileUnit&parentNaId=2456161

Unfortunately, that doesn't recognize the parentNaId field name...perhaps that's a result of not having permission without an api key. In any case, I don't know R myself, so you'll have to work out how to use all of this in httr.

I hope this helps a bit.

like image 29
Breaks Software Avatar answered Nov 15 '22 01:11

Breaks Software


from the folks who wrote the API over at National Archives and Records Administration..

Hi Anthony,

There's no need to scrape; NARA's catalog has an open API. If I understand right, you want to download all of the media files (what our catalog calls "objects") in all the file units in the series "Home Mortgage Disclosure Data Files" (NAID 2456161).

The API allows fielded search on any field in the data, so rather than have a search parameter like "parentNaId", the best way to do that query would be to search on that specific field, i.e., bring back all records where the parent series NAID is 2456161. If you open up one of those file units to look at the data by using the identifier (e.g. https://catalog.archives.gov/api/v1?naIds=2580657), you can see the field that contains the parent series is called "description.fileUnit.parentSeries". So, all your records file units and their objects will be in https://catalog.archives.gov/api/v1?description.fileUnit.parentSeries=2456161. If you want back just the objects without the file unit records, you can add the "&type=object" parameter. Or if you want the file unit metadata, you can also restrict the results with "type=description," since every file unit record also contains all the data for their child objects. It looks like there are over 1000 results, so you will also need to use the "rows" parameter to ask for all the results in one query, or paginate with the "offset" parameter and smaller "rows" values, since the default response is only the first 10 results.

Within the object metadata, you will field the fields with the URLs you can use to download the media, as well as other metadata that may be of interest. For example, note that some of these objects are considered electronic records, as in the original archival records from agencies, while others are NARA-created technical documentation. This is noted in the "designator" field.

Let me know if you still have any questions.

Thanks! Dominic

like image 21
Anthony Damico Avatar answered Nov 15 '22 02:11

Anthony Damico