nodetool cfstats/tablestats shows the "Compacted partition maximum bytes"
Now how to find the key of this partition or other huge partitions ?
The purpose is to analyse why these partitions are growing big and correct the data model accordingly.
I have seen it's possible to see these partition keys in logs, but unfortunately my logs are periodically removed.
Try nodetool tablehistograms -- <keyspace> <table> command provides statistics about a table, including read/write latency, partition size, column count, and number of SSTables. This provides proper stats of the table like 95% percentile of raw_data table has partition size of 107MB and max of 3.44GB.
A partition key can have a partition key defined with multiple table columns which determines which node stores the data. For a table with a composite partition key, Cassandra uses multiple columns as the partition key. These columns form logical sets inside a partition to facilitate retrieval.
In this article, we learned that Cassandra uses a partition key or a composite partition key to determine the placement of the data in a cluster. The clustering key provides the sort order of the data stored within a partition. All of these keys also uniquely identify the data.
The partition key has a special use in Apache Cassandra beyond showing the uniqueness of the record in the database.. Please note that there will not be any error if you insert same partition key again and again as there is no constraint check.
You might look at nodetool toppartitions command which is supposed to show you the most active partitions. Sometimes it helps to analyze and manage your data.
You can use the instaclustr tools
https://www.instaclustr.com/support/documentation/tools/ic-tools-for-cassandra-sstables/
The following command is useful for finding big partitions:
ic-pstats [-n <num>] [-t <snapshot>] [-f <filter>] <keyspace> <column-family>
-n <num> Number of partitions to display in leaders lists
-t <name> Snapshot to analyse (snapshot name from nodetool listsnapshots). Snapshot is created if none is specified.
-f <files> Comma separated list of Data.db sstables to filter on
Another useful tool is sstable-tools:
https://github.com/tolbertam/sstable-tools
It has a describe command that show the widest and largest partitions
java -jar sstable-tools.jar describe ma-2-big-Data.db
The output is like this:
/Users/clohfink/git/sstable-tools/./src/test/resources/ma-2-big-Data.db
=======================================================================
Partitions: 1
Rows: 1
Tombstones: 0
Cells: 4
Widest Partitions:
[frodo] 1
Largest Partitions:
[frodo] 104 (104 B)
Tombstone Leaders:
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.010000
Size: 50 (50 B)
Compressor: org.apache.cassandra.io.compress.LZ4Compressor
Compression ratio: -1.0
Minimum timestamp: 1455937221199050 (02/19/2016 21:00:21)
Maximum timestamp: 1455937221199050 (02/19/2016 21:00:21)
SSTable min local deletion time: 2147483647 (01/18/2038 21:14:07)
SSTable max local deletion time: 2147483647 (01/18/2038 21:14:07)
TTL min: 0 (0 milliseconds)
Maybe you can use an external tool like Apache drill
or presto-db
to run a query like :
SELECT key1, key2, COUNT(*) AS total
FROM yourTable
GROUP BY key1, key2
ORDER BY total DESC
LIMIT 10;
Where key1
and key2
are part of your partition key.
This query will get the top 10 partitions by size.
Hope this can help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With