Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Duplicate rows/columns for the same primary key in Cassandra

Tags:

cassandra

cql

I have a table/columnfamily in Cassandra 3.7 with sensordata.

CREATE TABLE test.sensor_data (
    house_id int,
    sensor_id int,
    time_bucket int,
    sensor_time timestamp,
    sensor_reading map<int, float>,
    PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
) 

Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.

cqlsh:test> select * from sensor_data;

 house_id | sensor_id | time_bucket | sensor_time                     | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
        1 |         2 |           3 | 2016-01-02 03:04:05.000000+0000 |       {1: 101}
        1 |         2 |           3 | 2016-01-02 03:04:05.000000+0000 |       {1: 101}

I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.

Regardless, this shouldn't be possible. I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.

So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.

Thanks.

Update:

This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)

like image 730
Andreas Wederbrand Avatar asked Oct 01 '16 06:10

Andreas Wederbrand


People also ask

Can primary key be duplicate in Cassandra?

Yes the primary key has to be unique. Otherwise there would be no way to know which row to return when you query with a duplicate key.

Can a primary key contain duplicates?

You can define keys which allow duplicate values. However, do not allow duplicates on primary keys as the value of a record's primary key must be unique.

How do you duplicate a table in Cassandra?

You can use COPY command to export from one table and import into other table. From your example - I created 2 tables. user_country and user_car with respective primary keys. CREATE KEYSPACE user WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 } ; CREATE TABLE user.

What is Rowkey in Cassandra?

The row key is just another name for the PRIMARY KEY. It is the combination of all the partition and clustering fields, and it will map to just one row of data in a table. So when you do a read or write to a particular row key, it will access just one row.


1 Answers

I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:

SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;

I WAS able to replicate it doing by inserting the rows via an integer:

INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000); INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);

This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.

Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable

 house_id | sensor_id | time_bucket | sensor_time              | sensor_reading
----------+-----------+-------------+--------------------------+----------------
        1 |         2 |           3 | 2016-01-02 00:00:00+0000 |           null
        1 |         2 |           4 | 2016-01-01 23:00:00+0000 |           null
        1 |         2 |           4 | 2016-01-02 00:00:00+0000 |           null
        1 |         2 |           4 | 2016-01-02 00:00:00+0000 |           null
        1 |         2 |           4 | 2016-01-02 01:01:00+0000 |           null

edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...

like image 129
Highstead Avatar answered Oct 17 '22 05:10

Highstead