Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursively download resources from RESTful web service

I'd like to recursively download JSON resources from a RESTful HTTP endpoint and store these in a local directory structure, following links to related resources in the form of JSON strings containing HTTP URLs. Wget would seem to be a likely tool for the job, though its recursive download is apparently limited to HTML hyperlinks and CSS url() references.

The resources in question are Swagger documentation files similar to this one, though in my cases all of the URLs are absolute. The Swagger schema is fairly complicated, but it would be sufficient to follow any string that looks like an absolute HTTP(S) URL. Even better would be to follow absolute or relative paths specified in 'path' properties.

Can anyone suggest a general purpose recursive crawler that would do what I want here, or a lightweight way of scripting wget or similar to achieve it?

like image 979
Joe Lee-Moyet Avatar asked Oct 20 '14 10:10

Joe Lee-Moyet


1 Answers

I ended up writing a shell script to solve the problem:

API_ROOT_URL="http://petstore.swagger.wordnik.com/api/api-docs"
OUT_DIR=`pwd`

function download_json {
    echo "Downloading $1 to $OUT_DIR$2.json"
    curl -sS $1 | jq . > $OUT_DIR$2.json
}

download_json $API_ROOT_URL /api-index

jq -r .apis[].path $OUT_DIR/api-index.json | while read -r API_PATH; do
    API_PATH=${API_PATH#$API_ROOT_URL}
    download_json $API_ROOT_URL$API_PATH $API_PATH
done

This uses jq to extract the API paths from the index file, and also to pretty print the JSON as it is downloaded. As webron mentions this will probably only be of interest to people still using the 1.x Swagger schema, though I can see myself adapting this script for other problems in the future.

One problem I've found with this for Swagger is that the order of entries in our API docs is apparently not stable. Running the script several times in a row against our API docs (generated by swagger-springmvc) results in minor changes to property orders. This can be partly fixed by sorting the JSON objects' property keys with jq's --sort-keys option, but this doesn't cover all cases, e.g. a model schema's required property which is a plain array of string property names.

like image 112
Joe Lee-Moyet Avatar answered Oct 22 '22 18:10

Joe Lee-Moyet