hadoop JobTracker启动失败分析: org.apache.hadoop.mapred.JobTracker: java.lang.NullPointerException

今天安装了个patch,重新编译了hadoop-core-0.20.2-CDH3B4.jar后,分发到所有节点上,然后重启整个hadoop集群。发现namenode和secondarynamenode都启动正常,而jobtracker启动失败。查看jobtracker的log发现如下信息:

2011-07-12 00:21:40,744 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
2011-07-12 00:21:40,772 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 8000
2011-07-12 00:21:40,775 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 8000 webServer.getConnectors()[0].getLocalPort() returned 8000
2011-07-12 00:21:40,775 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 8000
2011-07-12 00:21:40,775 INFO org.mortbay.log: jetty-6.1.26
2011-07-12 00:21:41,058 INFO org.mortbay.log: Started SelectChannelConnector@hadoop001.sqa:8000
2011-07-12 00:21:41,060 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
2011-07-12 00:21:41,101 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 9001
2011-07-12 00:21:41,101 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 8000
2011-07-12 00:21:42,196 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 0 time(s).
2011-07-12 00:21:43,202 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 1 time(s).
2011-07-12 00:21:44,204 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 2 time(s).
2011-07-12 00:21:45,206 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 3 time(s).
2011-07-12 00:21:46,209 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 4 time(s).
2011-07-12 00:21:47,211 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 5 time(s).
2011-07-12 00:21:48,213 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 6 time(s).
2011-07-12 00:21:49,215 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 7 time(s).
2011-07-12 00:21:50,218 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 8 time(s).
2011-07-12 00:21:51,221 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop001.sqa/10.1.1.1:9000. Already tried 9 time(s).
2011-07-12 00:21:51,221 FATAL org.apache.hadoop.mapred.JobTracker: java.lang.NullPointerException
        at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2290)
        at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2043)  
        at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:294)
        at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:286)
        at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:4767)

2011-07-12 00:21:51,222 INFO org.apache.hadoop.mapred.JobTracker: SHUTDOWN_MSG:

显然,这是在启动JobTracker时发生的一个严重的空指针异常。由于这个提示信息实在有限,只能通过查看hadoop源代码来定位问题了。在HadoopCDH3B4mapredorgapachehadoopmapredJobTracker.java文件的 2290 行是:

      } catch (IOException ie) {
        LOG.info("problem cleaning system directory: " +
                 systemDir.makeQualified(fs), ie);
      }

可见,这是在清除一个系统目录时报的错,那么到底是什么目录呢?

从jobtracker的log可以看到打印的最后一行正常的信息是

2011-07-12 00:21:41,101 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 8000

对比代码,发现是 2207 行的LOG.info打印出来的。在第2290行代码之前结合LOG.info可以定位到出问题的代码如下:

        if (!fs.exists(systemDir)) {
          LOG.info("Creating the system directory");
          if (FileSystem.mkdirs(fs, systemDir, 
                                new FsPermission(SYSTEM_DIR_PERMISSION))) {
            // success
            break;
          } else {
            LOG.error("Mkdirs failed to create " + systemDir);
          }
        } else {
          LOG.info("Cleaning up the system directory");
          fs.setPermission(systemDir, new FsPermission(SYSTEM_DIR_PERMISSION));
          // It exists, just set permissions and clean the contents up
          deleteContents(fs, systemDir);
          break;
        }
      } catch (AccessControlException ace) {
        LOG.warn("Failed to operate on mapred.system.dir (" +
                 systemDir.makeQualified(fs) +
                 ") because of permissions.");
        LOG.warn("This directory should be owned by the user '" +
                 UserGroupInformation.getCurrentUser() + "'");
        LOG.warn("Bailing out ... ", ace);
        throw ace;
      } catch (IOException ie) {
        LOG.info("problem cleaning system directory: " +
                 systemDir.makeQualified(fs), ie);
      }

应该是在执行第11行开始的分支,清除系统目录时出错(第14行)。

那么这个systemDir到底是啥?查看代码可以发现:

  public String getSystemDir() {
    Path sysDir = new Path(conf.get("mapred.system.dir", "/tmp/hadoop/mapred/system"));  
    return fs.makeQualified(sysDir).toString();
  }

原来是这个目录的问题。mapred.system.dir是在mapred-site.xml文件中定义的,是mapreduce执行时的共享目录,用于存储job执行时的一些控制信息,该目录不能是本地目录,只能是HDFS目录,可以为相对目录(相对当前用户)。

查看我这个hadoop集群的mapred-site.xml文件,发现是这样定义的:

<property>
  <name>mapred.system.dir</name>
  <value>mapred/system</value>
  <description>The directory where MapReduce stores control files.</description>
</property>

这个目录的权限会被设置为0700,并且所属用户和组与当前登录hadoop集群的用户有关。

//system directory is completely owned by the JobTracker
final static FsPermission SYSTEM_DIR_PERMISSION =
  FsPermission.createImmutable((short) 0700); // rwx------

FileStatus systemDirStatus = fs.getFileStatus(systemDir);
if (!systemDirStatus.getOwner().equals(
    getMROwner().getShortUserName())) {
  throw new AccessControlException("The systemdir " + systemDir +
      " is not owned by " + getMROwner().getShortUserName());
}
if (!systemDirStatus.getPermission().equals(SYSTEM_DIR_PERMISSION)) {
  LOG.warn("Incorrect permissions on " + systemDir +
      ". Setting it to " + SYSTEM_DIR_PERMISSION);
  fs.setPermission(systemDir,new FsPermission(SYSTEM_DIR_PERMISSION));
}

查看当前hadoop集群发现这个目录的权限不对。从代码中可以知道,jobtracker启动的时候发现如果没有这个目录会自己创建。

于是直接删除这个目录

hadoop fs -rmr /user/hadoop/mapred

然后重启集群,发现namenode和econdarynamenodeke可以启动,jobtracker启动一会后报同样的错误异常退出。

在网上找了下,找到一个相关的bug:NullPointerException in Jobtracker when it is started without Name Node

刚才namenode启动后是在安全模式,jt连接不上nn,于是会报这个空指针异常。

等namenode结束安全模式后,再手工启动jobtracker:

./hadoop-daemon.sh start jobtracker

这下集群正常启动了。看看jobtracker和dfs的web ui显示正常,集群ok了。

btw:重启后正常的信息是这样的。

[hadoop@hadoop001 hadoop-logs]$ ~/hadoop/bin/hadoop fs -ls /user/hadoop/
Found 2 items
drwxrwxrwx   - hadoop admin          0 2011-07-11 13:24 /user/hadoop/.Trash
drwxrwxrwx   - hadoop admin          0 2011-07-12 00:47 /user/hadoop/mapred
[hadoop@hadoop001 hadoop-logs]$ ~/hadoop/bin/hadoop fs -ls /user/hadoop/mapred
Found 1 items
drwx------   - hadoop admin          0 2011-07-12 00:50 /user/hadoop/mapred/system
[hadoop@hadoop001 hadoop-logs]$ ~/hadoop/bin/hadoop fs -ls /user/hadoop/mapred/system
Found 1 items
-rw-------   2 hadoop admin          4 2011-07-12 00:47 /user/hadoop/mapred/system/jobtracker.info

发表评论

电子邮件地址不会被公开。 必填项已用*标注