Intro

进入实验室后有了更多可用服务器的条件,参照Hadoop集群安装配置教程_Hadoop3.1.3_Ubuntu进行了一下真正的Hadoop集群的安装部署。基本步骤参照教程,但基于自己实验室里条件有不少改动的地方。

首先基于实验室原有Hadoop平台,把实验对象从Hadoop3.1.3改成了Hadoop2.7.7。

Overview

摘抄自林子雨老师的教程:

当Hadoop采用分布式模式部署和运行时,存储采用分布式文件系统HDFS,而且,HDFS的名称节点和数据节点位于不同机器上。这时,数据就可以分布到多个节点上,不同数据节点上的数据计算可以并行执行,这时的MapReduce分布式计算能力才能真正发挥作用。
为了降低分布式模式部署难度,本教程简单使用两个节点(两台物理机器)来搭建集群环境,一台机器作为 Master节点,局域网IP地址为x.x.x.x,另一台机器作为 Slave 节点,局域网 IP 地址为x.x.x.y。由三个以上节点构成的集群,也可以采用类似的方法完成安装部署。
Hadoop 集群的安装配置大致包括以下步骤:
(1)步骤1:选定一台机器作为 Master;
(2)步骤2:在Master节点上创建hadoop用户、安装SSH服务端、安装Java环境;
(3)步骤3:在Master节点上安装Hadoop,并完成配置;
(4)步骤4:在其他Slave节点上创建hadoop用户、安装SSH服务端、安装Java环境;
(5)步骤5:将Master节点上的“/usr/local/hadoop”目录复制到其他Slave节点上;
(6)步骤6:在Master节点上开启Hadoop;

实验室服务器环境

通过教育网可以用ssh登陆进入实验室服务器的网关(gateway),管理员给我创建了一个“lpj”的账号并设置登陆密码,给了我网关的IP号即可通过ssh lpj@x.x.x.x连接。在gateway中有/home/lpj这个文件夹,这里的gateway就相当于一台装有Ubuntu系统的电脑,只不过是通过ssh远程连接。这里我的PC机端是在Windows10系统中用MobaXterm软件进行ssh连接,还挺好用的。连接成功后终端变成:

1
[lpj@gateway ~]$

查看/etc/hostname文件内容就是:

1
gateway

然后在网关中可以用ssh直接连接服务器结点,但是也需要管理员在服务器结点中给我创建我的账号。由于Hadoop集群需要至少两个服务器结点(两个不同的IP地址),所以管理员在两个CPU服务器中给我新建了“lpj”账号。现在相当于我又拿到了两台装有Ubuntu系统的电脑,然后通过lpj这个账号登陆进去,每个服务器中都有/home/lpj这个文件夹。由于实验室配置文件比较成熟,直觉已经设定好了主机名:直接去查看每个服务器上的/etc/hosts文件,两个服务器的配置均有以下内容:

1
2
3
4
127.0.0.1   localhost

192.168.232.100 cpu-node0
192.168.232.103 cpu-node3

再查看/etc/hostname:

1
2
3
cpu-node0

cpu-node3

其中cpu-node0和cpu-node3就是实验室分配给我的两台服务器,在这里就先定义cpu-node0为Matser,cpu-node3为Slave(不改hosts文件里的名字了,一是没必要二也没root权限)。在网关里通过ssh cpu-node0ssh cpu-node3即可连接进这两个服务器,然后我的终端就分别变成了:

1
2
3
[lpj@cpu-node0 ~]$

[lpj@cpu-node3 ~]$

然后测试一下是否相互ping得通:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[lpj@cpu-node0 ~]$ ping cpu-node3 -c 3
PING cpu-node3 (192.168.232.103) 56(84) bytes of data.
64 bytes from cpu-node3 (192.168.232.103): icmp_seq=1 ttl=64 time=0.289 ms
64 bytes from cpu-node3 (192.168.232.103): icmp_seq=2 ttl=64 time=0.274 ms
64 bytes from cpu-node3 (192.168.232.103): icmp_seq=3 ttl=64 time=0.253 ms

--- cpu-node3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.253/0.272/0.289/0.014 ms

[lpj@cpu-node3 ~]$ ping cpu-node0 -c 3
PING cpu-node0 (192.168.232.100) 56(84) bytes of data.
64 bytes from cpu-node0 (192.168.232.100): icmp_seq=1 ttl=64 time=0.347 ms
64 bytes from cpu-node0 (192.168.232.100): icmp_seq=2 ttl=64 time=0.324 ms
64 bytes from cpu-node0 (192.168.232.100): icmp_seq=3 ttl=64 time=0.255 ms

--- cpu-node0 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.255/0.308/0.347/0.044 ms

至此,网络配置在实验室环境本已设置好的情况下检验完毕。

SSH无密码登录节点

让Master节点可以SSH无密码登录到各个Slave节点上。首先,生成Master节点的公匙。由于我的账户是新建的,没有生成过公钥,则无需删除。但我已有一个空的ssh文件夹:

1
2
[lpj@cpu-node0 .ssh]$ ls -a
. ..

于是直接进入ssh文件夹生成:

1
2
cd ~/.ssh
ssh-keygen -t rsa # 执行该命令后,遇到提示信息,一直按回车就可以

完成后生成了两个文件:

1
2
[lpj@cpu-node0 .ssh]$ ls -a
. .. id_rsa id_rsa.pub

这里有一条命令可以直接把cpu-node0的公钥文件传到cpu-node3并生效:

1
ssh-copy-id -i /home/lpj/.ssh/id_rsa.pub lpj@cpu-node3

其中需要输一次密码,然后加入成功:

1
2
3
4
5
6
7
8
9
10
[lpj@cpu-node0 .ssh]$ ssh-copy-id -i /home/lpj/.ssh/id_rsa.pub lpj@cpu-node3
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/lpj/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
lpj@cpu-node3's password:

Number of key(s) added: 1

Now try logging into the machine, with: "ssh 'lpj@cpu-node3'"
and check to make sure that only the key(s) you wanted were added.

测试一下,免密成功:

1
2
3
[lpj@cpu-node0 .ssh]$ ssh cpu-node3
Last login: Tue Sep 8 13:51:48 2020 from cpu-node0
[lpj@cpu-node3 ~]$

注:这里要重复cpu-node3对cpu-node0、cpu-node0对cpu-node0的免密操作,不然最后启动结点时会denied!

Hadoop的本地安装

这里参照了Hadoop3.1.3安装教程:单机&伪分布式配置的步骤,首先需要把下载下来的hadoop-2.7.7.tar.gz传到服务器上。在这里,第一步先用MobaXterm的传文件功能把hadoop-2.7.7.tar.gz从自己的PC机传到gateway里,然后查看一下:

1
2
[lpj@gateway ~]$ ls -a
. .. .bash_history .bash_logout .bash_profile .bashrc .cache .config hadoop-2.7.7.tar.gz .local .mozilla .ssh .Xauthority

接下来需要用scp hadoop-2.7.7.tar.gz lpj@cpu-node0:/home/lpj/从网关传到主节点Master的服务器中(在配置好之后再打包发给Slave):

1
2
3
4
5
6
7
8
9
10
11
12
13
[lpj@cpu-node0 ~]$ ls -a
. .. .bash_history .bash_logout .bash_profile .bashrc .cache .config .local .mozilla .ssh
[lpj@cpu-node0 ~]$ exit
logout
Connection to cpu-node0 closed.
[lpj@gateway ~]$ scp hadoop-2.7.7.tar.gz lpj@cpu-node0:/home/lpj/
lpj@cpu-node0's password:
hadoop-2.7.7.tar.gz 100% 322MB 6.6MB/s 00:49
[lpj@gateway ~]$ ssh cpu-node0
lpj@cpu-node0's password:
Last login: Tue Sep 8 10:49:09 2020 from gateway
[lpj@cpu-node0 ~]$ ls -a
. .. .bash_history .bash_logout .bash_profile .bashrc .cache .config hadoop-2.7.7.tar.gz .local .mozilla .ssh

然后就地解压并重命名(我没有root就不解压进usr/local了):

1
2
tar -zxf hadoop-2.7.7.tar.gz -C /home/lpj
mv ./hadoop-2.7.7/ ./hadoop2.7

另外由于我的账户是新建的,需要配置一下JAVA_HOME环境变量,在.bashrc里加入:

1
2
3
4
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

令其生效并测试一下版本号:

1
2
3
4
5
[lpj@cpu-node0 ~]$ source ~/.bashrc
[lpj@cpu-node0 ~]$ java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-b04)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)

配置JAVA环境需要在Slave中也完成该操作。

然后再进入hadoop文件夹里测试hadoop的版本号:

1
2
3
4
5
6
7
[lpj@cpu-node0 hadoop]$ ./bin/hadoop version
Hadoop 2.7.7
Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac
Compiled by stevel on 2018-07-18T22:47Z
Compiled with protoc 2.5.0
From source with checksum 792e15d20b12c74bd6f19a1fb886490
This command was run using /home/lpj/hadoop2.7/share/hadoop/common/hadoop-common-2.7.7.jar

默认的非分布式模式(本地模式)即安装成功了,无需进行其他配置即可运行。需要注意的是两个服务器中都需要进行以上步骤,分别完成自己本地模式的Hadoop安装。

此外,为了可以在任意目录中直接使用hadoop、hdfs等命令了,可以配置PATH变量。在Master节点(cpu-node0)上进行配置,在.bashrc文件最上面的位置加入下面一行内容:

1
export PATH=$PATH:/home/lpj/hadoop2.7/bin:/home/lpj/hadoop2.7/sbin

然后用hadoop命令测试一下:

1
2
3
4
5
6
7
[lpj@cpu-node0 ~]$ hadoop version
Hadoop 2.7.7
Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac
Compiled by stevel on 2018-07-18T22:47Z
Compiled with protoc 2.5.0
From source with checksum 792e15d20b12c74bd6f19a1fb886490
This command was run using /home/lpj/hadoop2.7/share/hadoop/common/hadoop-common-2.7.7.jar

配置集群/分布式环境

在配置集群/分布式模式时,需要修改/hadoop2.7/etc/hadoop目录下的配置文件,这里仅设置正常启动所必须的设置项,包括slaves 、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml共5个文件,更多设置项以后再说吧。

修改文件slaves

需要把所有数据节点(datanode)的主机名写入该文件,每行一个,默认为 localhost(即把本机作为数据节点),所以,在伪分布式配置时,就采用了这种默认的配置,使得节点既作为名称节点(namenode)也作为数据节点。在进行分布式配置时,可以保留localhost,让Master节点同时充当名称节点和数据节点,或者也可以删掉localhost这行,让Master节点仅作为名称节点使用。

本次安装让Master节点仅作为名称节点使用,因此将slaves文件中原来的localhost删除,只添加如下一行内容:

1
cpu-node3

修改文件core-site.xml

把core-site.xml文件修改为如下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://cpu-node0:9000</value>
</property>
<property>
<name>fs.checkpoint.period</name>
<value>3600</value>
<description>The number of seconds between two periodic checkpoints.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/lpj/hadoop2.7/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.checkpoint.size</name>
<value>67108864</value>
<description>The size of the current edit log (in bytes) that triggers
a periodic checkpoint even if the fs.checkpoint.period hasn't expired.
</description>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>

修改文件hdfs-site.xml

对于Hadoop的分布式文件系统HDFS而言,一般都是采用冗余存储,冗余因子通常为3,也就是说,一份数据保存三份副本。但是,本教程只有一个Slave节点作为数据节点,即集群中只有一个数据节点,数据只能保存一份,所以 ,dfs.replication的值还是设置为 1。hdfs-site.xml具体内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>cpu-node0:50049</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/lpj/hadoop2.7/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/lpj/hadoop2.7/tmp/dfs/data</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>

修改文件mapred-site.xml

“/hadoop2.7/etc/hadoop”目录下只有mapred-site.xml.template,将其重命名为mapred-site.xml,并修改文件配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>cpu-node0:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>cpu-node0:19888</value>
</property>
</configuration>

修改文件 yarn-site.xml

把yarn-site.xml文件配置成如下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>cpu-node0</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>

上述5个文件全部配置完成以后,需要把Master节点上的“/home/lpj/hadoop”文件夹复制到各个节点上。如果之前已经运行过伪分布式模式,建议在切换到集群模式之前首先删除之前在伪分布式模式下生成的临时文件。具体来说就是删除hadoop根目录里的tmp和logs文件夹,没有的话就不用管。

开始启动

压缩后scp传输:

1
2
tar -zcf ~/hadoop2.7.master.tar.gz ~/hadoop2.7
scp ~/hadoop2.7.master.tar.gz cpu-node3:/home/lpj

然后在Slave1节点上执行如下命令:

1
2
rm -r /home/lpj/hadoop    # 删掉旧的(如果存在)
tar -zxf ~/hadoop2.7.master.tar.gz -C /home/lpj

同样,如果有其他Slave节点,也要执行将hadoop.master.tar.gz传输到Slave节点以及在Slave节点解压文件的操作。

首次启动Hadoop集群时,需要先在Master节点执行名称节点的格式化(只需要执行这一次,后面再启动Hadoop时,不要再次格式化名称节点),命令如下:

1
hdfs namenode -format

成功的话,会看到 “successfully formatted” 的提示,返回信息有一句:

1
2020-09-08 16:18:32,989 INFO common.Storage: Storage directory /home/lpj/hadoop/tmp/dfs/name has been successfully formatted.

现在就可以启动Hadoop了,启动需要在Master节点上进行,执行如下命令:

1
2
3
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver

前两个start脚本亦可用一个命令代替:

1
start-all.sh

这个指令会开启所有start脚本,所以也会有This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh这样的warning提示。如果不是反复调试中偷懒的话尽量不要用这种省事的方式。

debug

建议用jps检查进程时也是一个脚本一个脚本地检查,在启动完start-dfs.sh后分别对两个服务器用jps检查启动情况:

1
2
3
4
5
6
7
[lpj@cpu-node0 ~]$ jps
34806 NameNode
35095 SecondaryNameNode
35294 Jps

[lpj@cpu-node3 ~]$ jps
7361 Jps

很好,cpu-node3上没有启动DataNode。这时候需要打开cpu-node3中hadoop2.7/logs/hadoop-lpj-datanode-cpu-node3.hustlab.log查看日志文件:

1
2
3
4
2020-09-16 15:38:10,678 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain
java.net.BindException: Problem binding to [0.0.0.0:50010] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
......
Caused by: java.net.BindException: Address already in use

显示地址已被占用,经查询是发现实验室中本已有一个hadoop是用了cpu-node3作为slave节点,而那个hadoop项目基本上采用了默认的地址和端口配置,故现在自己搭建的一个新的hadoop项目需要自定义地址端口加以避让。修改hdfs-site.xml,加上:

1
2
3
4
5
6
7
8
9
10
11
12
<property>
<name>dfs.datanode.address</name>
<value>cpu-node3:50011</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>cpu-node3:50076</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>cpu-node3:50021</value>
</property>

然后在终端中用stop-dfs.sh停止原先启动的进程,再用一遍start-dfs.sh重启即可:

1
2
3
4
5
6
[lpj@cpu-node0 ~]$ start-dfs.sh
Starting namenodes on [cpu-node0]
cpu-node0: starting namenode, logging to /home/lpj/hadoop2.7/logs/hadoop-lpj-namenode-cpu-node0.hustlab.out
cpu-node3: starting datanode, logging to /home/lpj/hadoop2.7/logs/hadoop-lpj-datanode-cpu-node3.hustlab.out
Starting secondary namenodes [cpu-node0]
cpu-node0: starting secondarynamenode, logging to /home/lpj/hadoop2.7/logs/hadoop-lpj-secondarynamenode-cpu-node0.hustlab.out

此时在cpu-node3中已可以看到启动后的DataNode进程。

然后启动start-yarn.sh

1
2
3
4
[lpj@cpu-node0 ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/lpj/hadoop2.7/logs/yarn-lpj-resourcemanager-cpu-node0.hustlab.out
cpu-node3: starting nodemanager, logging to /home/lpj/hadoop2.7/logs/yarn-lpj-nodemanager-cpu-node3.hustlab.out

再去用jps分别查看cpu-node0和cpu-node3:

1
2
3
4
5
6
7
8
9
10
[lpj@cpu-node0 ~]$ jps
39570 NameNode
39847 SecondaryNameNode
40696 Jps
40365 ResourceManager

[lpj@cpu-node3 ~]$ jps
7961 DataNode
8413 Jps
8190 NodeManager

能看到ResourceManager和NodeManager即为启动成功。

最后用mr-jobhistory-daemon.sh start historyserver启动JobHistoryServer:

1
2
[lpj@cpu-node0 ~]$ mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /home/lpj/hadoop2.7/logs/mapred-lpj-historyserver-cpu-node0.hustlab.out

这里检查cpu-node0即可:

1
2
3
4
5
6
[lpj@cpu-node0 ~]$ jps
45763 Jps
45704 JobHistoryServer
45002 SecondaryNameNode
45243 ResourceManager
44655 NameNode

check

总结:如果已经正确启动,则用jps在Master节点上可以看到NameNode、ResourceManager、SecondaryNameNode和JobHistoryServer进程,在Slave节点可以看到DataNode和NodeManager进程。缺少任一进程都表示出错,回去查看对应进程的log文件。

另外还需要在Master节点上通过命令“hdfs dfsadmin -report”查看数据节点是否正常启动,如果屏幕信息中的“Live datanodes”不为 0 ,则说明集群启动成功。由于本次实验只有1个Slave节点充当数据节点,因此,数据节点启动成功以后,会显示如下图所示信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[lpj@cpu-node0 ~]$ hdfs dfsadmin -report
Configured Capacity: 5366572122112 (4.88 TB)
Present Capacity: 5282481352704 (4.80 TB)
DFS Remaining: 5282481340416 (4.80 TB)
DFS Used: 12288 (12 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (1):

Name: 192.168.232.103:50011 (cpu-node3)
Hostname: cpu-node3
Decommission Status : Normal
Configured Capacity: 5366572122112 (4.88 TB)
DFS Used: 12288 (12 KB)
Non DFS Used: 84090769408 (78.32 GB)
DFS Remaining: 5282481340416 (4.80 TB)
DFS Used%: 0.00%
DFS Remaining%: 98.43%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Sep 16 16:19:38 CST 2020

执行分布式实例

首先创建HDFS上的用户目录,命令如下:

1
hdfs dfs -mkdir /user/lpj

然后,在HDFS中创建一个input目录,并把“/home/lpj/hadoop2.7/etc/hadoop”目录中的配置文件作为输入文件复制到input目录中,命令如下:

1
2
hdfs dfs -mkdir /user/lpj/input
hdfs dfs -put /home/lpj/hadoop2.7/etc/hadoop/*.xml /user/lpj/input

可以用hdfs dfs -ls [path]命令查看某个目录下的所有内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[lpj@cpu-node0 ~]$ hdfs dfs -ls /user/lpj
Found 1 items
drwxr-xr-x - lpj supergroup 0 2020-09-16 16:43 /user/lpj/input
[lpj@cpu-node0 ~]$ hdfs dfs -ls /user/lpj/input
Found 9 items
-rw-r--r-- 1 lpj supergroup 4436 2020-09-16 16:46 /user/lpj/input/capacity-scheduler.xml
-rw-r--r-- 1 lpj supergroup 1779 2020-09-16 16:46 /user/lpj/input/core-site.xml
-rw-r--r-- 1 lpj supergroup 9683 2020-09-16 16:46 /user/lpj/input/hadoop-policy.xml
-rw-r--r-- 1 lpj supergroup 1713 2020-09-16 16:46 /user/lpj/input/hdfs-site.xml
-rw-r--r-- 1 lpj supergroup 620 2020-09-16 16:46 /user/lpj/input/httpfs-site.xml
-rw-r--r-- 1 lpj supergroup 3518 2020-09-16 16:46 /user/lpj/input/kms-acls.xml
-rw-r--r-- 1 lpj supergroup 5540 2020-09-16 16:46 /user/lpj/input/kms-site.xml
-rw-r--r-- 1 lpj supergroup 1109 2020-09-16 16:46 /user/lpj/input/mapred-site.xml
-rw-r--r-- 1 lpj supergroup 1245 2020-09-16 16:46 /user/lpj/input/yarn-site.xml

接着就可以运行 MapReduce 作业了,命令如下:

1
hadoop jar /home/lpj/hadoop2.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar grep /user/lpj/input /user/lpj/output 'dfs[a-z.]+'

运行时会输出MapReduce作业的进度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
[lpj@cpu-node0 ~]$ hadoop jar /home/lpj/hadoop2.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar grep /user/lpj/input /user/lpj/output 'dfs[a-z.]+'
20/09/16 16:47:54 INFO client.RMProxy: Connecting to ResourceManager at cpu-node0/192.168.232.100:8032
20/09/16 16:47:55 INFO input.FileInputFormat: Total input paths to process : 9
20/09/16 16:47:56 INFO mapreduce.JobSubmitter: number of splits:9
20/09/16 16:47:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1600243226377_0001
20/09/16 16:47:56 INFO impl.YarnClientImpl: Submitted application application_1600243226377_0001
20/09/16 16:47:56 INFO mapreduce.Job: The url to track the job: http://cpu-node0:8088/proxy/application_1600243226377_0001/
20/09/16 16:47:56 INFO mapreduce.Job: Running job: job_1600243226377_0001
20/09/16 16:48:05 INFO mapreduce.Job: Job job_1600243226377_0001 running in uber mode : false
20/09/16 16:48:05 INFO mapreduce.Job: map 0% reduce 0%
20/09/16 16:48:09 INFO mapreduce.Job: map 56% reduce 0%
20/09/16 16:48:10 INFO mapreduce.Job: map 67% reduce 0%
20/09/16 16:48:13 INFO mapreduce.Job: map 100% reduce 0%
20/09/16 16:48:14 INFO mapreduce.Job: map 100% reduce 100%
20/09/16 16:48:15 INFO mapreduce.Job: Job job_1600243226377_0001 completed successfully
20/09/16 16:48:15 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=285
FILE: Number of bytes written=1235413
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=30693
HDFS: Number of bytes written=419
HDFS: Number of read operations=30
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=9
Launched reduce tasks=1
Data-local map tasks=9
Total time spent by all maps in occupied slots (ms)=23940
Total time spent by all reduces in occupied slots (ms)=2649
Total time spent by all map tasks (ms)=23940
Total time spent by all reduce tasks (ms)=2649
Total vcore-milliseconds taken by all map tasks=23940
Total vcore-milliseconds taken by all reduce tasks=2649
Total megabyte-milliseconds taken by all map tasks=24514560
Total megabyte-milliseconds taken by all reduce tasks=2712576
Map-Reduce Framework
Map input records=854
Map output records=9
Map output bytes=261
Map output materialized bytes=333
Input split bytes=1050
Combine input records=9
Combine output records=9
Reduce input groups=9
Reduce shuffle bytes=333
Reduce input records=9
Reduce output records=9
Spilled Records=18
Shuffled Maps =9
Failed Shuffles=0
Merged Map outputs=9
GC time elapsed (ms)=791
CPU time spent (ms)=6020
Physical memory (bytes) snapshot=2622357504
Virtual memory (bytes) snapshot=21854842880
Total committed heap usage (bytes)=1958215680
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=29643
File Output Format Counters
Bytes Written=419
20/09/16 16:48:15 INFO client.RMProxy: Connecting to ResourceManager at cpu-node0/192.168.232.100:8032
20/09/16 16:48:15 INFO input.FileInputFormat: Total input paths to process : 1
20/09/16 16:48:15 INFO mapreduce.JobSubmitter: number of splits:1
20/09/16 16:48:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1600243226377_0002
20/09/16 16:48:15 INFO impl.YarnClientImpl: Submitted application application_1600243226377_0002
20/09/16 16:48:15 INFO mapreduce.Job: The url to track the job: http://cpu-node0:8088/proxy/application_1600243226377_0002/
20/09/16 16:48:15 INFO mapreduce.Job: Running job: job_1600243226377_0002
20/09/16 16:48:25 INFO mapreduce.Job: Job job_1600243226377_0002 running in uber mode : false
20/09/16 16:48:25 INFO mapreduce.Job: map 0% reduce 0%
20/09/16 16:48:30 INFO mapreduce.Job: map 100% reduce 0%
20/09/16 16:48:35 INFO mapreduce.Job: map 100% reduce 100%
20/09/16 16:48:36 INFO mapreduce.Job: Job job_1600243226377_0002 completed successfully
20/09/16 16:48:36 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=285
FILE: Number of bytes written=246463
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=547
HDFS: Number of bytes written=207
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2145
Total time spent by all reduces in occupied slots (ms)=2290
Total time spent by all map tasks (ms)=2145
Total time spent by all reduce tasks (ms)=2290
Total vcore-milliseconds taken by all map tasks=2145
Total vcore-milliseconds taken by all reduce tasks=2290
Total megabyte-milliseconds taken by all map tasks=2196480
Total megabyte-milliseconds taken by all reduce tasks=2344960
Map-Reduce Framework
Map input records=9
Map output records=9
Map output bytes=261
Map output materialized bytes=285
Input split bytes=128
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=285
Reduce input records=9
Reduce output records=9
Spilled Records=18
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=144
CPU time spent (ms)=1490
Physical memory (bytes) snapshot=444129280
Virtual memory (bytes) snapshot=4380962816
Total committed heap usage (bytes)=346030080
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=419
File Output Format Counters
Bytes Written=207

hdfs dfs -cat [path]查看执行完毕后的输出结果:

1
2
3
4
5
6
7
8
9
10
[lpj@cpu-node0 ~]$ hdfs dfs -cat /user/lpj/output/*
1 dfsadmin
1 dfs.webhdfs.enabled
1 dfs.replication
1 dfs.namenode.secondary.http
1 dfs.namenode.name.dir
1 dfs.datanode.ipc.address
1 dfs.datanode.http.address
1 dfs.datanode.data.dir
1 dfs.datanode.address

运行程序时,输出目录不能存在

运行 Hadoop 程序时,为了防止覆盖结果,程序指定的输出目录(如 output)不能存在,否则会提示错误,因此运行前需要先删除输出目录。在实际开发应用程序时,可考虑在程序中加上如下代码,能在每次运行时自动删除输出目录,避免繁琐的命令行操作:

1
>Configuration conf = new Configuration();Job job = new Job(conf); /* 删除输出目录 */Path outputPath = new Path(args[1]);outputPath.getFileSystem(conf).delete(outputPath, true);

Finish

至此,顺利完成了Hadoop集群搭建。若要关闭Hadoop集群,需要在Master节点执行如下命令:

1
2
3
stop-yarn.sh
stop-dfs.sh
mr-jobhistory-daemon.sh stop historyserver