Googling for a definition either returns results for a column oriented DB or gives very vague definitions. My understanding is that wide column stores consist of column families which consist of rows and columns. Each row within said family is stored together on disk. This sounds like how row oriented databases store their data. Which brings me to my first question: How are wide column stores different from a regular relational DB table? This is the way I see it: <pre class="prettyprint"><code>* column family -> table * column family column -> table column * column family row -> table row </code></pre> This image from Database Internals simply looks like two regular tables: <img src="https://i.stack.imgur.com/HzMbH.png" alt="Two column families, contents, and anchors"> The guess I have as to what is different comes from the fact that "multi-dimensional map" is mentioned along side wide column stores. So here is my second question: Are wide column stores sorted from left to right? Meaning, in the above example, are the rows sorted first by <code>Row Key</code>, then by <code>Timestamp</code>, and finally by <code>Qualifier</code>?

Let's start with the definition of a wide column database. <blockquote> Its architecture uses (a) persistent, sparse matrix, multi-dimensional mapping (row-value, column-value, and timestamp) in a tabular format meant for massive scalability (over and above the petabyte scale). </blockquote> A relational database is designed to maintain the relationship between the entity and the columns that describe the entity. A good example is a Customer table. The columns hold values describing the Customer's name, address, and contact information. All of this information is the same for each and every customer. A wide column database is one type of NoSQL database. Maybe this is a better image of four wide column databases. <img src="https://i.stack.imgur.com/rDWwy.png" alt="Wide column databases"> My understanding is that the first image at the top, the Column model, is what we called an entity/attribute/value table. It's an attribute/value table within a particular entity (column). For Customer information, the first wide-area database example might look like this. <pre class="prettyprint"><code>Customer ID Attribute Value ----------- --------- --------------- 100001 name John Smith 100001 address 1 10 Victory Lane 100001 address 3 Pittsburgh, PA 15120 </code></pre> Yes, we could have modeled this for a relational database. The power of the attribute/value table comes with the more unusual attributes. <pre class="prettyprint"><code>Customer ID Attribute Value ----------- --------- --------------- 100001 fav color blue 100001 fav shirt golf shirt </code></pre> Any attribute that a marketer can dream up can be captured and stored in an attribute/value table. Different customers can have different attributes. The Super Column model keeps the same information in a different format. <pre class="prettyprint"><code>Customer ID: 100001 Attribute Value --------- -------------- fav color blue fav shirt golf shirt </code></pre> You can have as many Super Column models as you have entities. They can be in separate NoSQL tables or put together as a Super Column family. The Column Family and Super Column family simply gives a row id to the first two models in the picture for quicker retrieval of information.

What exactly is a wide column store?

Tags:

rdbms

wide-column-store

Googling for a definition either returns results for a column oriented DB or gives very vague definitions.

My understanding is that wide column stores consist of column families which consist of rows and columns. Each row within said family is stored together on disk. This sounds like how row oriented databases store their data. Which brings me to my first question:

How are wide column stores different from a regular relational DB table? This is the way I see it:

* column family        -> table * column family column -> table column * column family row    -> table row

This image from Database Internals simply looks like two regular tables:

Two column families, contents, and anchors

The guess I have as to what is different comes from the fact that "multi-dimensional map" is mentioned along side wide column stores. So here is my second question:

Are wide column stores sorted from left to right? Meaning, in the above example, are the rows sorted first by Row Key, then by Timestamp, and finally by Qualifier?

651

asked May 25 '20 20:05

Moo

2 Answers

Let's start with the definition of a wide column database.

Its architecture uses (a) persistent, sparse matrix, multi-dimensional mapping (row-value, column-value, and timestamp) in a tabular format meant for massive scalability (over and above the petabyte scale).

A relational database is designed to maintain the relationship between the entity and the columns that describe the entity. A good example is a Customer table. The columns hold values describing the Customer's name, address, and contact information. All of this information is the same for each and every customer.

A wide column database is one type of NoSQL database.

Maybe this is a better image of four wide column databases.

Wide column databases

My understanding is that the first image at the top, the Column model, is what we called an entity/attribute/value table. It's an attribute/value table within a particular entity (column).

For Customer information, the first wide-area database example might look like this.

Customer ID    Attribute    Value -----------    ---------    ---------------      100001    name         John Smith      100001    address 1    10 Victory Lane      100001    address 3    Pittsburgh, PA  15120

Yes, we could have modeled this for a relational database. The power of the attribute/value table comes with the more unusual attributes.

Customer ID    Attribute    Value -----------    ---------    ---------------      100001    fav color    blue      100001    fav shirt    golf shirt

Any attribute that a marketer can dream up can be captured and stored in an attribute/value table. Different customers can have different attributes.

The Super Column model keeps the same information in a different format.

Customer ID: 100001 Attribute    Value ---------    -------------- fav color    blue fav shirt    golf shirt

You can have as many Super Column models as you have entities. They can be in separate NoSQL tables or put together as a Super Column family.

The Column Family and Super Column family simply gives a row id to the first two models in the picture for quicker retrieval of information.

107

answered Sep 22 '22 09:09

Gilbert Le Blanc

Most (if not all) Wide-column stores are indeed row-oriented stores in that every parts of a record are stored together. You can see that as a 2-dimensional key-value store. The first part of the key is used to distribute the data across servers, the second part of the key lets you quickly find the data on the target server.

Wide-column stores will have different features and behaviors. However, Apache Cassandra, for example, allows you to define how the data will be sorted. Take this table for example:

| id | country | timestamp  | message | |----+---------+------------+---------| | 1  | US      | 2020-10-01 | "a..."  | | 1  | JP      | 2020-11-01 | "b..."  | | 1  | US      | 2020-09-01 | "c..."  | | 2  | CA      | 2020-10-01 | "d..."  | | 2  | CA      | 2019-10-01 | "e..."  | | 2  | CA      | 2020-11-01 | "f..."  | | 3  | GB      | 2020-09-01 | "g..."  | | 3  | GB      | 2020-09-02 | "h..."  | |----+---------+------------+---------|

If your partitioning key is (id) and your clustering key is (country, timestamp), the data will be stored like this:

[Key 1] 1:JP,2020-11-01,"b..." | 1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..." [Key2] 2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..." [Key3] 3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."

Or in table form:

| id | country | timestamp  | message | |----+---------+------------+---------| | 1  | JP      | 2020-11-01 | "b..."  | | 1  | US      | 2020-09-01 | "c..."  | | 1  | US      | 2020-10-01 | "a..."  | | 2  | CA      | 2019-10-01 | "e..."  | | 2  | CA      | 2020-10-01 | "d..."  | | 2  | CA      | 2020-11-01 | "f..."  | | 3  | GB      | 2020-09-01 | "g..."  | | 3  | GB      | 2020-09-02 | "h..."  | |----+---------+------------+---------|

If you change the primary key (composite of partitioning and clustering key) to (id, timestamp) WITH CLUSTERING ORDER BY (timestamp DESC) (id is the partitioning key, timestamp is the clustering key in descending order), the result would be:

[Key 1] 1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..." | 1:JP,2020-11-01,"b..."  [Key2] 2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..." [Key3] 3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."

Or in table form:

| id | country | timestamp  | message | |----+---------+------------+---------| | 1  | US      | 2020-09-01 | "c..."  | | 1  | US      | 2020-10-01 | "a..."  | | 1  | JP      | 2020-11-01 | "b..."  | | 2  | CA      | 2019-10-01 | "e..."  | | 2  | CA      | 2020-10-01 | "d..."  | | 2  | CA      | 2020-11-01 | "f..."  | | 3  | GB      | 2020-09-01 | "g..."  | | 3  | GB      | 2020-09-02 | "h..."  | |----+---------+------------+---------|

answered Sep 24 '22 09:09

Lewis Diamond

Related questions
                            
                                Find longest matching ngrams in MySQL
                            
                                Is this a good way to model address information in a relational database?
                            
                                What's the difference between NoSQL and a Column-Oriented database?
                            
                                What are the different types of keys in RDBMS?
                            
                                How do cursors work in Python's DB-API?
                            
                                Can one make a relational database using MongoDB?
                            
                                Overnormalization
                            
                                Difference between sparse index and dense index
                            
                                How to solve "Batch update returned unexpected row count from update; actual row count: 0; expected: 1" problem?
                            
                                Why don't DBMS's support ASSERTION
                            
                                Why aren't OODBMS as widespread as RDBMS?
                            
                                PostgreSQL IN operator with subquery poor performance
                            
                                Should I normalize my DB or not?
                            
                                What's the correct name for an "association table" (a many-to-many relationship) [closed]
                            
                                Two foreign keys referencing the same primary key
                            
                                How to detect duplicate rows in a SQL Server table?
                            
                                How many significant digits should I store in my database for a GPS coordinate?
                            
                                Why isn't RDBMS Partition Tolerant in CAP Theorem and why is it Available?
                            
                                What is the difference between DBMS and RDBMS?
                            
                                Create an inline SQL table on the fly (for an excluding left join)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With