用jstack发现的一例hbase region server线程死锁

最近在升级hadoop集群的过程中,hbase偶尔出现无法读写的情况,看region server进程并没有挂掉,于是用jstack看下是什么情况:

Found one Java-level deadlock:
=============================
'IPC Server handler 127 on 60020':
  waiting to lock monitor 0x00007fa185def330 (object 0x000000057becf198, a java.lang.Object),
  which is held by 'IPC Server handler 21 on 60020'
'IPC Server handler 21 on 60020':
  waiting to lock monitor 0x00007fa1984780b0 (object 0x000000057b96e178, a java.lang.Object),
  which is held by 'regionserver60020.logRoller'
'regionserver60020.logRoller':
  waiting to lock monitor 0x00007fa185def330 (object 0x000000057becf198, a java.lang.Object),
  which is held by 'IPC Server handler 21 on 60020'

Java stack information for the threads listed above:
===================================================
'IPC Server handler 127 on 60020':
        at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1305)
        - waiting to lock <0x000000057becf198> (a java.lang.Object)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1460)
        at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2231)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:1968)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3428)
        at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1389)
'IPC Server handler 21 on 60020':
 at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1314)
        - waiting to lock <0x000000057b96e178> (a java.lang.Object)
        - locked <0x000000057becf198> (a java.lang.Object)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1460)
        at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2231)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:1968)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3428)
        at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1389)
'regionserver60020.logRoller':
        at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1305)
        - waiting to lock <0x000000057becf198> (a java.lang.Object)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1283)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1456)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:876)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:657)
        - locked <0x000000057b96e178> (a java.lang.Object)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
        at java.lang.Thread.run(Thread.java:662)

Found 1 deadlock.

显然,这是logRoller和HLog.syncer两个线程发生了死锁。

在Hbase中,采用WAL(Write Ahead Log)的方式来保证数据不丢失。一个Region Server对应一个HLog,任何数据都是先写HLog,然后才真正的写MemStore。HLog文件有个轮转的过程,当达到默认大小(一般是block size * 0.95)就会写一个新的HLog文件,logRoller线程就是做这个事情的。而buffer中的数据每隔一秒钟会写一次HLog文件,这个是HLog.syncer线程负责的。logRoller和HLog.syncer进程不能同时进行,不然会导致数据问题,于是就需要通过锁来进行互斥了。

查看了下代码,是在HLog.syncer往HDFS写文件发生IO Exception时会触发死锁。一番google后发现是我们集群使用的hbase 0.94.1版本的一个bug,目前在社区已经修复了:https://issues.apache.org/jira/browse/HBASE-7728

其实频繁触发Hbase的这个bug主要还是因为这次hadoop集群升级引起的。由于hadoop集群升级后导致集群的diskIO变高,加上有些大的任务在测试,HLog.syncer发生IO Exception的几率就更高了,所以RS假死就更频繁了。

发表评论

电子邮件地址不会被公开。 必填项已用*标注