Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to work in sheets with large dataset

I have an automated dashboard in a sheet "my dashboard" in Google Drive.

The data for the dashboard comes from Google Analytics (GA) via the API. Most data I have been able to pull in using the Google Sheets GA add on.

My source data for one of the tables in the dashboard is pretty large - too large to fit within the sheet itself.

So, with some limited scripting skills and the help of this forum and some online tutorials I created a script in Google-Apps-Script that queries the GA api and returns all of the data that I need and then puts it into a csv file in the same directory as the main dashboard.

So now, in my "dashboard" folder in Drive I have 2 files: "my dashboard" - a Sheet and "my data" a csv file. I could, if I wanted, instead output the results of the api call to a sheet I just assumed a csv file would be more efficient.

I opened "my data" csv file in a Gsheet and called it "combined". Here is a sample of what the data look like:

ga:year ga:month    ga:medium   ga:source   ga:campaign ga:goal1Completions
2013    5   (none)  (direct)    (not set)   116
2013    5   (not set)   Adperio silvercontact?dp    0
2013    5   (not set)   Conde*it    _medium=email   0
2013    5   (not set)   Hearst  (not set)   0
2013    5   (not set)   stackersocial   stackersocial   0
2013    5   12111   9591    201fHN000xRGGszT3aEAA11uIsaT000.    0
2013    5   12111   9591    201fHN00BrT.K.AY0Hvf3q1uIqgl000.    0
2013    5   12111   9591    201fHN00CK619oBE3OsBZp1uIqGX000.    0
2013    5   12111   9591    201fHN00DFXJxnUu1jx25M1uIzkC000.    0

There are ~ 65k rows of data.

Now, in my dashboard I need a table that groups and aggregates the data in "combined" sheet (or I could use the csv somehow?). My preferred go to formula is typically e.g.

=sum(filter(ga:goal1Completions, ga:year="2015"... ))

Pulling in and querying this data is proving difficult and I'm hoping for some advice.

I cannot simply import the data into my dashboard since I get the alert about exceeding sheet maximum size of 200k cells. So I must rely on formula to import data and run calculation each time. Here is an example of what I'm using just now:

=SUM(filter(IMPORTRANGE("https://docs.google.com/spreadsheets/d/xxxxxxxx/edit?usp=sharing","combined!F2:F"),
IMPORTRANGE("https://docs.google.com/spreadsheets/d/xxxxxxxx/edit?usp=sharing","combined!A2:A")=year(G$17),
IMPORTRANGE("https://docs.google.com/spreadsheets/d/xxxxxxxx/edit?usp=sharing","combined!B2:B")=month(G$17),
IMPORTRANGE("https://docs.google.com/spreadsheets/d/xxxxxxxx/edit?usp=sharing","combined!C2:C")="(direct)",
IMPORTRANGE("https://docs.google.com/spreadsheets/d/xxxxxxxx/edit?usp=sharing","combined!D2:D")="(none)"))

So for each parameter in the function in a cell I must import the column of data and subset it.

I'm sure there must be a better way. Right now this does actually work but it's exceptionally slow.

Avenues I was thinking about:

  1. Someone on SO mentioned using the cache service here. But how would that work or how would I integrate it with my IMPORTRANGE() functions above?
  2. Is there any benefit to having a csv file or should I just output the results into a GSheet from the outset?
  3. In the script I run to get the data, before outputting it into a csv file (or a Gsheet if that's better?) I have the data as an array before conversion into a file. Is there anything fancy I can do here, such as querying this array directly from within the sheet? Bearing in mind my scripting skills are pretty basic.
  4. Presumably I could create a function to call the GA api individually for each cell that the function is in, returning the results in each individual cell (so in the example above, a function to call the ga api for where year=year(G17)&month=month(G17)&medium=[some_other_cell_reference). So this option would result in more api calls but result in smaller data to work on. Not sure if that's a good idea or going the wrong direction entirely.

I hope that I've communicated my problem sufficiently. I need to find a more efficient way of querying external my data in the dashboard.

like image 223
Doug Fir Avatar asked Jun 10 '15 14:06

Doug Fir


2 Answers

Instead of importing one column at a time by using multiple IMPORTRANGE and FILTER use one single IMPORTRANGE to pull all the columns that you need in single step and use QUERY to do the operations that you need.

In the same way that you include IMPORTRANGE inside FILTER, you can include it inside QUERY

I.E. For simplicity I will add only one filtering criteria

`=QUERY(IMPORTRANGE("key","range"),"select sum(Col6) where Col1 ='"&G$17&'")`
like image 104
Rubén Avatar answered Sep 30 '22 19:09

Rubén


Since the source data is too large for Google Sheets, you could use an alternative data store that would allow you to perform interesting queries - Fusion Tables, for instance.

You can access Fusion Tables from Google Apps Script via the FusionTables advanced service, which must be enabled first.

In Database abstraction with google apps script, Bruce McPherson introduces an abstraction layer that can have a variety of services as back-end warehouses. That takes care of the details of dealing with FusionTables for you. Better yet, you can refactor your code to first use the cDbAbstraction library with your existing Google Sheet back-end, then add DriverFusion for the full data set later.

like image 21
Mogsdad Avatar answered Sep 30 '22 21:09

Mogsdad