I'm still not very clear about the difference between a column-based relational database vs. column-based NoSQL database.
Google BigQuery enables SQL-like query so how can it be NoSQL?
Column-based relational database I know of are InfoBright, Vertica and Sybase IQ.
Column-based NoSQL database I know of are Cassandra and HBase.
The following article about Redshift starts with saying "NoSQL" but ends with PostgreSQL (which is relational) being used: http://nosqlguide.com/column-store/intro-to-amazon-redshift-a-columnar-nosql-database/
A few things to clarify here mostly about Google BigQuery.
BigQuery is a hybrid system that allows you to store data in columns, but it takes into the NoSQL world with additional features, like the record
type, and the nested
feature. Also you can have a 2Mbyte STRING column in which you can store raw document like a JSON
document. See other data formats and limits that apply. Also you can write User Defined Functions in Javascript, eg: you can paste in a library that does NLP javascript library.
Now that you have all these capabilities to store data you can use JSON Functions for example to query your document stored in one of the columns, hence this can be used as no schema storage, because you didn't defined your JSON document structure for that column, you just stored it as JSON. Got it?
Basic example to query from the meta column, which is a JSON document, the reason key, and doing a contains language construct to find out how many users have in that key the "unsubscribed" word:
SELECT
SUM(IF(JSON_EXTRACT_SCALAR(meta,'$.reason') contains 'unsubscribed',1,0))
FROM ...
On the other hand you have table-wildcard querying. This is needed if you have your rows across many tables. Table wildcard functions are a cost-effective way to query data from a specific set of tables. When you use a table wildcard function, BigQuery only accesses and charges you for tables that match the wildcard. So this means that it's advised to store data in similar tables just partitioned in different tables per a set time frame eg: daily, monthly tables.
We should not forget that BigQuery is append only by design, so you cannot update old records, there is no UPDATE language construct (Update: There is now DML language construct to do some update/delete ops). Instead you need to append a new record and your queries must be written in a way that always work with the last version of your data. If your system is event driven, than this is very simple because each event will be appended in the BQ. But if the user updates it's profile, you need to store the profile again, you cannot update old row. You need to have a column version/date that tells you which is the most recent version, and your queries will be written first to obtain the most recent version of your rows then deal with the logic.
You can use something like over/partition by that field and use the most recent value seqnum=1
.
This returns from profile
, the last email
for each user_id
defined by the most recent entry by timestamp
column.
SELECT email
FROM
(SELECT email
row_number() over (partition BY user_id
ORDER BY TIMESTAMP DESC) seqnum
FROM [profile]
)
WHERE seqnum=1
First, remember that NOSQL is commonly considered as abbreviation to "Not Only SQL", so there is no contradiction for the system of having both SQL interface, and some NOSQL features. Having said that, both Redshift and BigQuery have their foundation in column based databases. Redshift is based on Paraccel which is classic column based RDBMS targeted towards data warehousing, and BigQuery is based on internal Google's column based data processing technology called "dremel".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With