In BigQuery you give users/roles (or authorized views) access on dataset-level and not views/table-level. The challenge I want to address is how to manage access control in bigquery when I have hundreds of tables and views and many different roles/departments that should have access to both views shared across all departments and views only for a particular role/department?
Example: let's say I have a source dataset with source tables A->D and three views for each table exposing different fields based on sensitivity of data 1->3. Also, I have three roles (Blue, Green, Red). If I could manage access on table-level it would look like this:
View: roles
A1: Blue, Red
A2: Red
A3: Red
B1: Blue, Green, Red
B2: Green, Red
B3: Red
C1: Green, Red
C2: Green, Red
C3: Red
D1: Red
D2: Red
D3: Red
Given these requirements, I can't create datasets based only on sensitivity (1-3) or source (A-D) and manage access based on that. The only solution I can see that meet this is generating a dataset per role. This could be done manually if the number of roles and views are few, but when managing 10+ roles and 50+ views it becomes more challenging.
The only solution I can come up with is a CI/CD setup (cloud build) with file/s defining datasets (i.e. roles), dependencies and DDL-statement/s. Letting a script/program iterate through the file/s and generate views and give access (authorized view) to source. Example file:
{"roles":["crm_analyst", "admin", "customer_service_agent"],
"ddl":"CREATE VIEW `myproject.'{role}'.newview` AS SELECT column_1, column_2, column_3 FROM myproject.mydataset.myview",
"dependencies":"myproject.mydataset.myview"}
How do other companies solve this? There are large banks that have migrated to bigquery that must have loads of departments and different sensitivity of data sets.
I ended up writing a python script that reads view definitions from json-files and then generate datasets and views and give correct access rights. The solution is a bit rough and could make use of dependency mapping (when a view queries another view) instead of the current solution iterating views until all views are generated or the script can't generate anymore views (broken dependencies). The script generates two datasets per group, one with READER (suffix '_ro') and one with WRITER (suffix '_rw') to make sure that views generated by data team can't be modified and at the same time give a sandbox for the group. The group should be an e-mail group and the name of datasets will be the local-part of the email address. The script is executed by google cloud build and triggered by a push to our github repo.
Example view definition (path: views/view_test.json)
{
"groups":["[email protected]", "[email protected]"],
"sql":"SELECT * FROM `{project}.shared_views.test_view`"
}
Generates the following datasets (access) and views:
analysts_ro ([email protected]:READER):
- view_test
analysts_rw ([email protected]:WRITER):
(empty)
developers_ro ([email protected]:READER):
- view_test
developers_rw ([email protected]:WRITER):
(empty)
shared_views (analysts_ro.view_test:None, developers_ro.view_test:None):
- test_view
I made the python script available on github as open source as part of datahem, feel free to clone, improve and use for your own purposes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With