Hadoop NameNode recovery from metadata backup

Question

I am trying NN metadata recovery. I have taken backup of Namenode and Journal node meta data . It contains edit logs and fsimages.

There are two NNs in my system. I take backup of metadata on both NNs (hdfs metadata & QJM metadata) at regular frequency. I want to test recovery procedure in a worst case scenario. Assume both the NNs and Journal node are down with the metadata completely deleted.

I want to recover NN metadata from backup and start NN. I know that there could be a data loss as the latest changes done after backup would be missing.

Questions:

Do you think such a scenario is possible/feasible ?
I am facing some issues related to txn id mismatch, committed txn id. Please tell if there is a solution for the same.

Steps tried:

Take metadata backup of NN and QJM. Do some hdfs file operations (create new files).
Stop NN and Journal node on both the machines.
Delete metadata from /data/hdfs and journal directories.
Restore Fsimages from backup (taken some time back).
Start NN. It fails with below exception.

Alternative approach: Restore all the edit logs and fsimage to both hdfs and qjm directories and start NN but still it fails.

Both the NNs are down and I can't bring up. I don't want to format hdfs as it will change Cluster ID and the backup won't be usable.

Exceptions:

There appears to be a gap in the edit log. We expected txid 71453, but got txid 71466
Client trying to move committed txid backward from 71599 to 71453
recoverUnfinalizedSegments failed for required journal. Decided to synchronize log to startTxId: 71453 but logger 10.204.64.26:8485 had seen txid 71599 committed

secfree · Accepted Answer

Because the latest FsImage and Edit has been lost or corrupted, you should try to recovery the Metadata

./bin/hadoop namenode -recover

Refer: NameNode Recovery Tools for the Hadoop Distributed File System
Because the journal is not sync with the namenode, you should reinit it

./bin/hdfs namenode -initializeSharedEdits
Because the recovered FsImage has lost the latest data the updated since last backup, you should check and delete the corrupted data

./bin/hadoop fsck -delete /

If you do not do fsck, the namenode may be stuck in safe mode, for too many unresponsive blocks.

Karthik · Answer

You can start namenode with recover flag enabled. Namenode recover will take care of corrupt maetadata.

./bin/hadoop namenode -recover

Hadoop NameNode recovery from metadata backup

Tags:

hadoop

Vikas Ranjan

2 Answers

secfree

Karthik

Recent Activity

Donate For Us

Hadoop NameNode recovery from metadata backup

Tags:

hadoop

Vikas Ranjan

2 Answers

secfree

Karthik

Related questions

Recent Activity

Donate For Us