Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra node is unable to start after a HDD failure

I have a 5 node Cassandra 2.0.7 cluster, each node has 4 HDDs. Recently one of these HDDs on node3 had failed and was replaced by a new shiny empty drive. After the replacement cassandra on this node was unable to start with this exception:

 INFO [main] 2014-06-02 12:45:17,232 ColumnFamilyStore.java (line 254) Initializing system.paxos
 INFO [main] 2014-06-02 12:45:17,236 ColumnFamilyStore.java (line 254) Initializing system.schema_columns
 INFO [SSTableBatchOpen:1] 2014-06-02 12:45:17,237 SSTableReader.java (line 223) Opening /mnt/disk2/cassandra/system/schema_columns/system-schema_columns-jb-310 (25418 bytes)
 INFO [main] 2014-06-02 12:45:17,241 ColumnFamilyStore.java (line 254) Initializing system.IndexInfo
 INFO [main] 2014-06-02 12:45:17,245 ColumnFamilyStore.java (line 254) Initializing system.peers
 INFO [SSTableBatchOpen:1] 2014-06-02 12:45:17,246 SSTableReader.java (line 223) Opening /mnt/disk3/cassandra/system/peers/system-peers-jb-25 (20411 bytes)
 INFO [main] 2014-06-02 12:45:17,253 ColumnFamilyStore.java (line 254) Initializing system.local
 INFO [SSTableBatchOpen:1] 2014-06-02 12:45:17,254 SSTableReader.java (line 223) Opening /mnt/disk3/cassandra/system/local/system-local-jb-35 (80 bytes)
 INFO [SSTableBatchOpen:2] 2014-06-02 12:45:17,254 SSTableReader.java (line 223) Opening /mnt/disk3/cassandra/system/local/system-local-jb-34 (80 bytes)
 ERROR [main] 2014-06-02 12:45:17,361 CassandraDaemon.java (line 237) Fatal exception during initialization
  org.apache.cassandra.exceptions.ConfigurationException: Found system keyspace files, but they couldn't be loaded!
    at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:532)
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:233)
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:462)
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:552)

Because of cassandra node being unable to start, I cannot use nodetool repair.

The only way I see to recover the node is to remove all data and bootstrap it from nearly bare metal. Is there a shorter way to recover in a typical HDD failure scenario?

like image 614
shutty Avatar asked Jun 02 '14 09:06

shutty


People also ask

How does Cassandra handle node failure?

If a node is down or unavailable during a write request, Cassandra handles this with the Hinted Handoff -- a mechanism where the coordinator node responsible for managing a write request will store hints (write mutations) and replay it to the replica when it comes back online.

How long can a Cassandra node be down?

Server will be down for more than 4 hours. Important to note, but by default each node can store hints for up to 3 hours. Or Cassandra will take care itself to replicate the data updated, created, deleted during these 4 hours. Maybe if you could limit the outage window to less than 3 hours.

What is node in Cassandra?

A node in Cassandra contains the actual data and it's information such that location, data center information, etc. A node contains the data such that keyspaces, tables, the schema of data, etc. you can perform operations such that read, write, delete data, etc. on a node.


1 Answers

Fixed the issue by these steps:

  • physically removed files related to system keyspace: cassandra was able to start and recreated it, but without any metadata about other keyspaces.

  • ran nodetool resetlocalschema, which synchronized keyspace schema from other nodes.

like image 160
shutty Avatar answered Oct 03 '22 15:10

shutty