Hadoop MapReduce 2.0 – Yarn 学习之安装配置与运行详解

在hadoop 0.19/0.20 的时代,由于NameNode和JobTracker成为单点,制约了hadoop的发展,hadoop集群在2000台左右规模,NN和JT已经不堪重负。从0.23版(也称为2.0)开始,支持分布式NameNode,实现了NameNode的横向扩展,使得hadoop集群可以支持上万个节点。同时NameNode HA也不在话下。

hadoop 2.0 的变化,有几篇博客讲的都不错,不再赘述。

http://blog.sina.com.cn/s/blog_4a1f59bf01010i9r.html

http://yanbohappy.sinaapp.com/?p=32

http://dongxicheng.org/mapreduce-nextgen/nextgen-mapreduce-introduction/

本文主要讲hadoop 2.0的安装与运行。

1.  机器角色

有三台机器 10.10.41.71,10.10.41.80,10.10.41.81

其中:10.10.41.71和10.10.41.80都作为NameNode和SecondaryNameNode。

10.10.41.81作为JobHistoryServer和ResourceManager

三台机器都是 DataNode

首先得保证三台机器互相之间可以ssh免密码登陆

2. 安装软件包和java环境

我采用的 0.23.1 版本,wget到本地然后tar zxvf解压就可以了

wget http://labs.renren.com/apache-mirror/hadoop/common/hadoop-0.23.1/hadoop-0.23.1.tar.gz

在 hadoop 0.23 版本里面,所有节点的配置文件都一样,不用区分NameNode和DataNode的配置,只要在一台机器上配置好,分发到其他机器就可以了。

修改 .bashrc,添加如下内容:

export JAVA_HOME=$HOME/java

export HADOOP_DEV_HOME=$HOME/hadoop-0.23.1
export HADOOP_MAPRED_HOME=${HADOOP_DEV_HOME}
export HADOOP_COMMON_HOME=${HADOOP_DEV_HOME}
export HADOOP_HDFS_HOME=${HADOOP_DEV_HOME}
export YARN_HOME=${HADOOP_DEV_HOME}
export HADOOP_CONF_DIR=${HADOOP_DEV_HOME}/etc/hadoop
export HDFS_CONF_DIR=${HADOOP_DEV_HOME}/etc/hadoop
export YARN_CONF_DIR=${HADOOP_DEV_HOME}/etc/hadoop
export HADOOP_LOG_DIR=${HADOOP_DEV_HOME}/logs

export PATH=${HADOOP_DEV_HOME}/bin:$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib:$CLASSPATH

执行命令的时候,一定要记得source下环境变量

source ~/.bashrc

3. 修改配置文件

cd /home/baoniu/hadoop-0.23.1/etc/hadoop

我的三台机器都作为DataNode,因此slaves文件改为:
10.10.41.71
10.10.41.80
10.10.41.81

由于在.bashrc里面配置了JAVA_HOME等环境变量,yarn-env.sh,hadoop-env.sh都不需要修改。

需要改动的几个文件是:mapred-site.xml,hdfs-site.xml,core-site.xml,yarn-site.xml。

各个文件的内容如下:

(1)core-site.xml

<configuration>
  <!-- file system properties -->
  <property>
    <name>fs.trash.interval</name>
    <value>360</value>
    <description>Number of minutes between trash checkpoints.
      If zero, the trash feature is disabled.
    </description>
  </property>

  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/baoniu/hadoop-0.23.1/temp</value>
  </property>

</configuration>

(2)hdfs-site.xml

<configuration>

  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/home/baoniu/hadoop-0.23.1/nn_dir</value>
    <description>Determines where on the local filesystem the DFS name node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
  </property>

  <property>
    <name>dfs.federation.nameservices</name>
    <value>ns1,ns2</value>
  </property>

  <property>
    <name>dfs.namenode.rpc-address.ns1</name>
    <value>10.10.41.71:9000</value>
  </property>

  <property>
    <name>dfs.namenode.http-address.ns1</name>
    <value>10.10.41.71:23001</value>
  </property>

  <property>
    <name>dfs.namenode.secondary.http-address.ns1</name>
    <value>10.10.41.71:23002</value>
  </property>

  <property>
    <name>dfs.namenode.rpc-address.ns2</name>
    <value>10.10.41.80:9000</value>
  </property>

  <property>
    <name>dfs.namenode.http-address.ns2</name>
    <value>10.10.41.80:23001</value>
  </property>

  <property>
    <name>dfs.namenode.secondary.http-address.ns2</name>
    <value>10.10.41.80:23002</value>
  </property>

</configuration>

(3)mapred-site.xml

<configuration>

  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>

  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>10.10.41.81:10020</value>
  </property>

  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>10.10.41.81:19888</value>
  </property>

</configuration>

需要注意的是mapreduce.framework.name=yarn这个配置,如果没有配置这项,那么提交的yarn job只会运行在locale模式,而不是分布式模式。

(4)yarn-site.xml

<configuration>

  <property>
    <description>The address of the resource tracker interface.</description>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>10.10.41.81:8025</value>
  </property>

  <property>
    <description>The address of the applications manager interface in the RM.</description>
    <name>yarn.resourcemanager.address</name>
    <value>10.10.41.81:8040</value>
  </property>

  <property>
    <description>The address of the scheduler interface.</description>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>10.10.41.81:8030</value>
  </property>

  <property>
    <description>The address of the RM admin interface.</description>
    <name>yarn.resourcemanager.admin.address</name>
    <value>10.10.41.81:8141</value>
  </property>

  <property>
    <description>The address of the RM web application.</description>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>10.10.41.81:8088</value>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce.shuffle</value>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>

  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/home/baoniu/hadoop_data/local</value>
  </property>

  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>/home/baoniu/hadoop_data/logs</value>
  </property>

  <property>
    <name>yarn.app.mapreduce.am.staging-dir</name>
    <value>/user</value>
  </property>

</configuration>

在配置文件修改好后,将hadoop-0.23.1整个目录都复制到其他机器上。

我写了个脚本来分发文件:

#!/bin/sh

hosts=`cat ~/host.txt`

for x in $hosts
do
        echo "===== $x ====="
        ssh $x "mkdir -p /home/baoniu/hadoop_data/local; mkdir -p /home/baoniu/hadoop_data/logs"
        scp -r ~/hadoop-0.23.1/  $x:~/
done

host.txt文件的内容就是其他要分发的机器:

10.10.41.80
10.10.41.81

4. 启动集群

(1)启动hdfs

在10.10.41.71上执行

/home/baoniu/hadoop-0.23.1/sbin/start-dfs.sh

这样会启动10.10.41.71和10.10.41.80上的NameNode,SecondaryNameNode,还有三台机器上的DataNode。可以jps看到启动的角色

[baoniu@v010071 sbin]$ jps
2372 SecondaryNameNode
1968 NameNode
2181 DataNode
3655 Jps

在10.10.41.81上只有启动了一个DataNode

[baoniu@v010081 hadoop-0.23.1]$ jps
19653 DataNode
19840 Jps

(2)启动yarn版本中新加的ResourceManager,NodeManager

到10.10.41.81上执行

/home/baoniu/hadoop-0.23.1/sbin/start-yarn.sh

这样会在10.10.41.81这台机器上启动ResourceManager,然后这三台机器上都会启动NodeManager

[baoniu@v010081 hadoop-0.23.1]$ jps
19653 DataNode
19927 ResourceManager
20443 Jps
20050 NodeManager

[baoniu@v010071 sbin]$ jps
2372 SecondaryNameNode
1968 NameNode
3775 NodeManager
5326 Jps
2181 DataNode

(3)启动 historyserver

[baoniu@v010081 hadoop-0.23.1]$ sbin/mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /home/baoniu/hadoop-0.23.1/logs/yarn-baoniu-historyserver-v010081.sqa.cm4.out
[baoniu@v010081 hadoop-0.23.1]$ jps
20517 JobHistoryServer
19653 DataNode
19927 ResourceManager
20572 Jps
20050 NodeManager

OK,到这里整个集群就启动成功了。

由于我们配置了两个NameNode,访问方式和以前稍有不同,如下:

[baoniu@v010071 ~]$ hadoop fs -ls hdfs://10.10.41.80:9000/
Found 1 items
drwxr-xr-x   - baoniu supergroup          0 2012-07-10 21:45 hdfs://10.10.41.80:9000/test2
[baoniu@v010071 ~]$ hadoop fs -ls hdfs://10.10.41.71:9000/
Found 7 items
drwxr-xr-x   - baoniu supergroup          0 2012-07-10 17:14 hdfs://10.10.41.71:9000/input

如果不想每次访问都带上hdfs前缀,可以在core-site.xml中加上如下的配置来指定一个默认的NameNode:

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://10.10.41.71:9000</value>
  </property>

5. run job

运行自带的wordcout例子:

#!/bin/sh

source ~/.bashrc

hadoop fs -rm -r hdfs://10.10.41.71:9000/output
hadoop jar $HADOOP_DEV_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.1.jar wordcount hdfs://10.10.41.71:9000/input hdfs://10.10.41.71:9000/output

可以通过web ui界面来查看集群的状况:

ResourceManager http://10.10.41.81:8088/cluster

NodeManager http://10.10.41.71:9999/node   http://10.10.41.80:9999/node   http://10.10.41.81:9999/node

NameNode http://10.10.41.71:23001/dfshealth.jsp

SecondaryNameNode http://10.10.41.71:23002/status.jsp

NameNode http://10.10.41.80:23001/dfshealth.jsp

SecondaryNameNode http://10.10.41.80:23002/status.jsp

JobHistory http://10.10.41.81:19888/jobhistory

cluster的界面如下:

hadoop-yarn-resource-manager-image

6. 其他

参考文献: http://blog.sina.com.cn/s/blog_4a1f59bf010116rh.html

cloudera的这篇 Deploying MapReduce v2 (YARN) on a Cluster

http://hadoop.apache.org/common/docs/r0.23.1/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html

发表评论

电子邮件地址不会被公开。 必填项已用*标注