Yarn集群中hive bug导致数据丢失问题排查

5月1日凌晨线上发生了一个诡异的问题:有个hive的job运行状态显示成功,但是数据少了很多,本来应该输出200个文件3亿多数据,但是结果只有一个part文件180多万数据产出!还好后续的流程检查出了这个巨大的数据差异,不然后果很严重。

ps: 当前集群采用的hadoop版本为hadoop-2.2.0-cdh5.0.0-beta-1,Hive版本为0.11,集群规模250台左右。

看hive job的log,一切正常:

2014-05-01 01:52:13,159 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 12415.64 sec
2014-05-01 01:52:14,194 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 12421.76 sec
2014-05-01 01:52:16,062 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 12424.53 sec
MapReduce Total cumulative CPU time: 0 days 3 hours 27 minutes 4 seconds 530 msec
Ended Job = job_1397554303529_84372
Loading data to table default.thor_uic_b_key
... ...
371473512 Rows loaded to tic_b_key
MapReduce Jobs Launched:
Job 0: Map: 1188  Reduce: 200   Cumulative CPU: 12424.53 sec   HDFS Read: 25372111957 HDFS Write: 3756581870 SUCCESS
Total MapReduce CPU Time Spent: 0 days 3 hours 27 minutes 4 seconds 530 msec
OK

看jobhistory页面,这个job也显示状态是成功的,计数器显示输出了371473512条数据。

但是为什么最终的输出目录只有一个文件呢?

[admin@s0001.cm6 ~]$ hadoop fs -ls /user/hive/warehouse/tic_b_key
Found 1 items
-rw-rw-r--   3 admin admin   18815170 2014-05-01 01:52 /user/hive/warehouse/tic_b_key/000078_1000

查看hdfs-audit.log文件,发现除了00078_1000这个文件外,其他199个文件都有delete操作存在:

2014-05-01 01:52:16,292 INFO FSNamesystem.audit: allowed=true ugi=admin (auth:SIMPLE) ip=/10.246.14.72 cmd=delete src=/tmp/hive-admin/hive_2014-05-01_01-46-44_962_6776205249541775191/_tmp.-ext-10000.intermediate/000198_1000 dst=null perm=null
2014-05-01 01:52:16,292 INFO FSNamesystem.audit: allowed=true ugi=admin (auth:SIMPLE) ip=/10.246.14.72 cmd=delete src=/tmp/hive-admin/hive_2014-05-01_01-46-44_962_6776205249541775191/_tmp.-ext-10000.intermediate/000199_1000 dst=null perm=null
2014-05-01 01:52:16,295 INFO FSNamesystem.audit: allowed=true ugi=admin (auth:SIMPLE) ip=/10.246.14.72 cmd=delete src=/tmp/hive-admin/hive_2014-05-01_01-46-44_962_6776205249541775191/_task_tmp.-ext-10000 dst=null perm=null
2014-05-01 01:52:16,445 INFO FSNamesystem.audit: allowed=true ugi=admin (auth:SIMPLE) ip=/10.246.14.72 cmd=delete src=/tmp/hive-admin/hive_2014-05-01_01-46-44_962_6776205249541775191 dst=null perm=null

从hdfs-audit.log文件分析知道:job确实产出了200个文件,但是有199个都被删除了,所以只剩下了一个文件000078_1000。

至于为什么删除,猜测是触发了什么条件,应该是hdfs.delete操作的。

此外看jobhistory还有个现象:这个hive job的AppMaster失败过,重试一次后job才成功。

由于我们的yarn集群配置了yarn.resourcemanager.am.max-attempts=10,job在AppMaster失败后会重试。


后来一番查找后发现是hive的bug:详见https://issues.apache.org/jira/browse/HIVE-6309

当前的hive版本是0.11也有这个问题。具体原因是:

在AppMaster失败过一次后,yarn会将AttempID设置为1000的倍数,代码在org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java:

    // All the new TaskAttemptIDs are generated based on MR
    // ApplicationAttemptID so that attempts from previous lives don't
    // over-step the current one. This assumes that a task won't have more
    // than 1000 attempts in its single generation, which is very reasonable.
    nextAttemptNumber = (appAttemptId - 1) * 1000;

而hive在job完成后,由于task有重试可能输出多个文件。所以hive会进行文件清理,同一个task的数据只保留大小最大的一个文件。见ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:

  @Override
  public void jobCloseOp(Configuration hconf, boolean success, JobCloseFeedBack feedBack)
      throws HiveException {
    try {
      if ((conf != null) && isNativeTable) {
        String specPath = conf.getDirName();
        DynamicPartitionCtx dpCtx = conf.getDynPartCtx();
        if (conf.isLinkedFileSink() && (dpCtx != null)) {
          specPath = conf.getParentDir();
        }
        Utilities.mvFileToFinalPath(specPath, hconf, success, LOG, dpCtx, conf,
          reporter);
      }

重点是这个 mvFileToFinalPath 操作,里面用的正则表达式判断文件是否同一个task的ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:

  private static final Pattern FILE_NAME_PREFIXED_TASK_ID_REGEX =
      Pattern.compile("^.*?((\\(.*\\))?[0-9]+)(_[0-9]{1,3})?(\\..*)?$");

问题就是这里了,这个正则表达式对attempID大于999的情况无法正确解析出taskId!

测试代码:

    public static void main(String[] args) {
        getPrefixedTaskIdFromFilename("attempt_1397554303529_84372_m_000012_0");
        getPrefixedTaskIdFromFilename("attempt_1397554303529_84372_m_000012_234");
        getPrefixedTaskIdFromFilename("attempt_1397554303529_84372_m_000013_0");
        getPrefixedTaskIdFromFilename("attempt_1397554303529_84372_m_000012_1000");
        getPrefixedTaskIdFromFilename("attempt_1397554303529_84372_m_000013_1000");
   }

输出结果:

TaskId for attempt_1397554303529_84372_m_000012_0 = 000012
TaskId for attempt_1397554303529_84372_m_000012_234 = 000012
TaskId for attempt_1397554303529_84372_m_000013_0 = 000013
TaskId for attempt_1397554303529_84372_m_000012_1000 = 1000
TaskId for attempt_1397554303529_84372_m_000013_1000 = 1000

对yarn来说,单个task的重试一般不会超过1000次,所以设置attemptId为1000的倍数也是合理的。但是hive设计的时候是基于hadoop 1.0,还没有AppMaster重试这种情况,所以只考虑了1000以内的attemptId。

发表评论

电子邮件地址不会被公开。 必填项已用*标注