Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ConnectionLoss for /hbase + Connection reset by peer?

I'm running a Hadoop MapReduce job on my local machine (pseudo-distributed) that reads from and writes into HBase. I'm intermittently getting an error which disrupts the job, even when the computer is left alone with no other significant processes running -- see log below. The output from a ZooKeeper Dump after the job has died looks like this, with the number of clients growing after a failed run:

HBase is rooted at /hbase
Master address: SS-WS-M102:60000
Region server holding ROOT: SS-WS-M102:60020
Region servers:
 SS-WS-M102:60020
Quorum Server Statistics:
 ss-ws-m102:2181
  Zookeeper version: 3.3.3-cdh3u0--1, built on 03/26/2011 00:20 GMT
  Clients:
   /192.168.40.120:58484[1]\(queued=0,recved=39199,sent=39203)
   /192.168.40.120:37129[1]\(queued=0,recved=162,sent=162)
   /192.168.40.120:58485[1]\(queued=0,recved=39282,sent=39316)
   /192.168.40.120:58488[1]\(queued=0,recved=39224,sent=39226)
   /192.168.40.120:58030[0]\(queued=0,recved=1,sent=0)
   /192.168.40.120:58486[1]\(queued=0,recved=39248,sent=39267)

My development team is currently using the CDH3U0 distribution, so HBase 0.90.1 -- is this an issue solved in a more recent release? Or should there be something I can do with the current setup? Should I just expect to restart ZK and kill off clients periodically? I'm open to any reasonable option that will allow my jobs to complete consistently.

2012-06-27 13:01:07,289 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server SS-WS-M102/192.168.40.120:2181
2012-06-27 13:01:07,289 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to SS-WS-M102/192.168.40.120:2181, initiating session
2012-06-27 13:01:07,290 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server SS-WS-M102/192.168.40.120:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
    at sun.nio.ch.IOUtil.read(IOUtil.java:169)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
    at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:858)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1130)
[lines above repeat 6 more times]
2012-06-27 13:01:17,890 ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:991)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:302)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:293)
    at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:156)
    at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:167)
    at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:145)
    at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:147)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:989)
    ... 15 more
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
    ... 16 more
like image 486
Cyranix Avatar asked Jun 27 '12 22:06

Cyranix


3 Answers

Turns out that I was hitting the low default limit of ZooKeeper (which I believe has been increased in more current releases). I had tried setting a higher limit in hbase-site.xml:

<property>
  <name>hbase.zookeeper.property.maxClientCnxns</name>
  <value>35</value>
</property>

But it didn't seem to work unless it was (also?) specified in zoo.cfg:

# can put this number much higher if desired
maxClientCnxns=35

The job can now run for hours and my ZK client list peaks at 12 entries.

like image 179
Cyranix Avatar answered Sep 21 '22 13:09

Cyranix


Check for following parameters :

zookeeper session timeout( zookeeper.session.timeout) --> try to increase and check

zookeeper ticktime(tickTime) -> increase and test

check for ulimit(linux command check for the user under which you hadoop/hbase is running) specificat

in the ulimit case you must have followin parameter somewhat to higher value.

open files make this somewhat 32K or more

max user processes make this as unlimited

after doing these changes verify most probably the error will be gone

like image 34
Infinity Avatar answered Sep 17 '22 13:09

Infinity


I've had issues similar to this in the past. A lot of time with HBase/Hadoop you'll see error messages that don't point to the true issue you are having so you have to be creative with it.

This is what I've found and it may or may not apply to you:

Are you opening a lot of connections to a table and are you closing them when finished? This can happen in a MR job if you are performing Scans/Gets in the Mapper or Reducer (which i don't think you want to do if it can be avoided).

Also, sometimes I get similar issues if my Mapper or Reducer is writing to the same Row a LOT. Try to distribute your writes or minimize writes to reducethis problem.

It also would help if you went into detail about the nature of your MR job. What does it do? Do you have example code?

like image 30
Tucker Avatar answered Sep 19 '22 13:09

Tucker