Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop NameNode recovery from metadata backup

Tags:

hadoop

I am trying NN metadata recovery. I have taken backup of Namenode and Journal node meta data . It contains edit logs and fsimages.

There are two NNs in my system. I take backup of metadata on both NNs (hdfs metadata & QJM metadata) at regular frequency. I want to test recovery procedure in a worst case scenario. Assume both the NNs and Journal node are down with the metadata completely deleted.

I want to recover NN metadata from backup and start NN. I know that there could be a data loss as the latest changes done after backup would be missing.

Questions:

  1. Do you think such a scenario is possible/feasible ?
  2. I am facing some issues related to txn id mismatch, committed txn id. Please tell if there is a solution for the same.

Steps tried:

  1. Take metadata backup of NN and QJM. Do some hdfs file operations (create new files).
  2. Stop NN and Journal node on both the machines.
  3. Delete metadata from /data/hdfs and journal directories.
  4. Restore Fsimages from backup (taken some time back).
  5. Start NN. It fails with below exception.

Alternative approach: Restore all the edit logs and fsimage to both hdfs and qjm directories and start NN but still it fails.

Both the NNs are down and I can't bring up. I don't want to format hdfs as it will change Cluster ID and the backup won't be usable.

Exceptions:

  1. There appears to be a gap in the edit log. We expected txid 71453, but got txid 71466
  2. Client trying to move committed txid backward from 71599 to 71453
  3. recoverUnfinalizedSegments failed for required journal. Decided to synchronize log to startTxId: 71453 but logger 10.204.64.26:8485 had seen txid 71599 committed
like image 798
Vikas Ranjan Avatar asked Jun 09 '14 11:06

Vikas Ranjan


2 Answers

  1. Because the latest FsImage and Edit has been lost or corrupted, you should try to recovery the Metadata

    ./bin/hadoop namenode -recover

    Refer: NameNode Recovery Tools for the Hadoop Distributed File System

  2. Because the journal is not sync with the namenode, you should reinit it

    ./bin/hdfs namenode -initializeSharedEdits

  3. Because the recovered FsImage has lost the latest data the updated since last backup, you should check and delete the corrupted data

    ./bin/hadoop fsck -delete /

    If you do not do fsck, the namenode may be stuck in safe mode, for too many unresponsive blocks.

like image 96
secfree Avatar answered Sep 21 '22 00:09

secfree


You can start namenode with recover flag enabled. Namenode recover will take care of corrupt maetadata.

./bin/hadoop namenode -recover
like image 28
Karthik Avatar answered Sep 20 '22 00:09

Karthik