I have a simple hadoop job that crawls websites and caches them to the HDFS. The mapper checks if a URL already exists in the HDFS and if so, uses it otherwise downloads the page and saves it to the HDFS.
If an network error (404, etc) is encountered while downloading the page, then the URL is skipped entirely - not written to the HDFS. Whenever I run a small list ~1000 websites, I always seem to encounter this error which crashes the job repeatedly in my pseudo distributed installation. What could be the problem?
I'm running Hadoop 0.20.2-cdh3u3.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/raj/cache/9b4edc6adab6f81d5bbb84fdabb82ac0 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1520)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:665)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
When a file is written to HDFS, it is replicated to multiple core nodes. When you see this error, it means that the NameNode daemon does not have any available DataNode instances to write data to in HDFS. In other words, block replication is not taking place.
Any file stored on HDFS is divided into fixed-size blocks (chunks). Each block is stored by replicating three times (by default).
The default replication factor is 3 which can be configured as per the requirement; it can be changed to 2(less than 3) or can be increased (more than 3.). Because of the following reason, ideal replication factor is 3: If one copy is not accessible and corrupted then the data can be read from other copy.
The default replication factor in HDFS is 3. This means that every block will have two more copies of it, each stored on separate DataNodes in the cluster.
By default the Replication Factor for Hadoop is set to 3 which can be configured means you can change it Manually as per your requirement like in above example we have made 4 file blocks which means that 3 Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup purpose.
Data Replication. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance.
The problem was an unclosed FileSystem InputStream instance in the mapper that was used for caching input to file system.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With