Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GitHub API: How to improve very in-efficient polling on activity events?

GitHub API provides the feature of activity events for users, orgs and repos. The APIs support pagination upto 10 pages for a total of 300 events with 30 events per page. Rate Limiting is achieved using ETAG headers. I am trying to poll this API to get the latest activity. However this scheme is very in-efficient due to the design supported by Github as mentioned. Lets say I make a request on page-1 by

https://api.github.com/users/me/events/orgs/my-org?page=1

and i will get an ETAG entry for this page. Now I move to the next page-2 and do

https://api.github.com/users/me/events/orgs/my-org?page=2

and will get the ETAG for this 2nd page. Similarly I can pull events from all 10 supported pages.

Now lets say that some activity was performed on my orgs Github account. Lets assume that only 1 new event occured. In this case when I poll the API for page-1 with the ETAG it will return the changed page with the new event included in it. Similarly polling on page-2 with its previous ETAG will also send the changed page. This change in page-2 is however the event that was previously the last event of page-1 and has now moved to the top on page-2. This "shift-to-next" will happen for all the pages. There is NO way to find out the number of NEW events that took place.The only solution is to keep polling on page-1 to get the latest events. However this approach has a serious flaw explained below:

The situation gets worse when the number of new events between my poll rounds is greater than 30(max items on one page). In this case, events prior to the latest new 30 events will slip to page-2 directly. If I only poll on page-1 i will loose these events that slipped to page-2. The only solution that is coming to my mind is to keep a cache of the entire events and then sweep on all pages. This is however a very in-efficient and un-desirable way to do it and kills the purpose of on events notification API.

I hope some github-dev can answer this

like image 742
auny Avatar asked Jun 25 '13 12:06

auny


1 Answers

Since each event has an ID and events are ordered in the response, you only need to remember the ID of the first event in the previous response (not all of the events).

So, the way I would do it is:

Initial fetch:

  1. fetch all event pages (pages from 1 to 10)
  2. store the ETAG of the first page
  3. store the ID of the first event in the first page

Subsequent fetches:

  1. conditionally fetch first page of events with the stored ETAG
  2. if a 304 Not modified response is received, then there are no new events so terminate
  3. if a 200 OK response is received, then we have new events. Fetch pages from 1 to 10 sequentially until the first page that contains the event with the ID equal to the stored ID. All newly fetched events up until that event are new events and should be processed. So, the number of new events is discovered incrementally as the result of fetching all events up until the event you have seen before. And you are fetching only pages that you have to fetch, not more than that.
  4. store the ETAG of the first page
  5. store the ID of the first event in the first page
  6. wait for some time and then go to step 1
like image 81
Ivan Zuzak Avatar answered Sep 30 '22 14:09

Ivan Zuzak