Intro

当开始着手实践 Hadoop 时，安装 Hadoop 就是第一步。感谢林子雨教授实验室提供的Hadoop系列完整教程，让我不再纠结各种博客里的差错。本文相当于转发一下系列教程中的Hadoop3.1.3安装教程_单机/伪分布式配置_Hadoop3.1.3/Ubuntu18.04(16.04)，再附上一点本人安装过程中的记录，以备参考。

环境

原教程使用 Ubuntu 18.04 64位 作为系统环境，本人使用 Ubuntu 20.04 64位，两个环境在操作过程中无差异。

创建hadoop用户

如果安装 Ubuntu 的时候不是用的 “hadoop” 用户（一般肯定不是），那么需要增加一个名为 hadoop 的用户，在终端输入：

bash

1	sudo useradd -m hadoop -s /bin/bash

这条命令创建了可以登陆的 hadoop 用户，并使用 /bin/bash 作为 shell。

接着使用如下命令设置密码，按提示输入两次密码：

bash

1	sudo passwd hadoop

可为 hadoop 用户增加管理员权限，方便部署，避免一些比较棘手的权限问题：

bash

1	sudo adduser hadoop sudo

最后注销当前用户，返回登陆界面。在登陆界面中选择刚创建的 hadoop 用户进行登陆。登陆后用sudo apt-get update更新一下。

P.S.登陆hadoop进去是一个全新的界面（相当于新装好的系统刚进去的样子），不过装的软件和搭的环境还在，只不过之前在自己主目录下的文件需要到/home/原账户名这里找，然后你会发现有一个并列的/home/hadoop文件夹就是现在的主目录。

安装SSH、配置SSH无密码登陆

集群、单节点模式都需要用到 SSH 登陆，Ubuntu 默认已安装了 SSH client，此外还需要安装 SSH server：

bash

1	sudo apt-get install openssh-server

然后ssh登陆本机：

bash

1	ssh localhost

此时可能会出现以下状况（选了yes之后也连接失败）：

bash

hadoop@cuper-Inspiron-7591:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:KdaSrcKwo3oMh/O692SC2cA0XUMScgb/mSXHb3iEI3g.
Are you sure you want to continue connecting (yes/no/[fingerprint])? 
Host key verification failed.

参考The authenticity of host ‘localhost (127.0.0.1)’ can’t be established的处理方法，去编辑/etc/ssh/ssh_config这个文件：

bash

1	sudo gedit /etc/ssh/ssh_config

在文件最后面加上：

text

1 2	StrictHostKeyChecking no UserKnownHostsFile /dev/null

加上后的末尾大概是这个样子的：

text

SendEnv LANG LC_*
HashKnownHosts yes
GSSAPIAuthentication yes
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

保存之后重新连接ssh就可以了：

bash

hadoop@cuper-Inspiron-7591:~$ ssh localhost
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
hadoop@localhost's password: 
Welcome to Ubuntu 20.04 LTS (GNU/Linux 5.4.0-40-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage


1 device has a firmware upgrade available.
Run `fwupdmgr get-upgrades` for more information.


91 updates can be installed immediately.
0 of these updates are security updates.
To see these additional updates run: apt list --upgradable


The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

但这样登陆是需要每次输入密码的，我们需要配置成SSH无密码登陆比较方便。首先用exit退出刚才的 ssh回到原先的终端窗口，退出时的终端反馈是这样的：

bash

hadoop@cuper-Inspiron-7591:~$ exit
注销

Connection to localhost closed.

然后利用 ssh-keygen 生成密钥，并将密钥加入到授权中：

bash

1
2
3

cd ~/.ssh/
ssh-keygen -t rsa
cat ./id_rsa.pub >> ./authorized_keys

终端反馈大概是这个样子，其中ssh-keygen -t rsa这条命令中的选项直接按回车默认即可：

bash

hadoop@cuper-Inspiron-7591:~$ cd ~/.ssh/
hadoop@cuper-Inspiron-7591:~/.ssh$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub
The key fingerprint is:
xxxxxxxxxxxxxxxxxxxxxxxx hadoop@cuper-Inspiron-7591
The key's randomart image is:
+---[RSA 3072]----+
|    .o .o=      o|
还挺牛逼，有二维码内味了
|      o++  .     |
+----[SHA256]-----+
hadoop@cuper-Inspiron-7591:~/.ssh$ cat ./id_rsa.pub >> ./authorized_keys
hadoop@cuper-Inspiron-7591:~/.ssh$

安装Java环境

如果在原来的账户里已经装好了Java的话，是不需要再去下载JDK重新安装的。只需在本账户的.bashrc文件末尾添加：

shell

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_162
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

然后令其生效：

bash

1	source ~/.bashrc

使用java -version测试一下查看是否安装成功，如正常会返回：

bash

hadoop@cuper-Inspiron-7591:~$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

安装 Hadoop3.1.3

首先去官网安装hadoop-3.1.3.tar.gz，这里官网也提供了几个国内镜像源，安装速度会快一点。

选择将 Hadoop 安装至 /usr/local/ 中：

bash

sudo tar -zxf ~/下载/hadoop-3.1.3.tar.gz -C /usr/local    # 解压到/usr/local中
cd /usr/local/
sudo mv ./hadoop-3.1.3/ ./hadoop            # 将文件夹名改为hadoop
sudo chown -R hadoop ./hadoop       # 修改文件权限

Hadoop 解压后即可使用。输入如下命令来检查 Hadoop 是否可用：

bash

1 2	cd /usr/local/hadoop ./bin/hadoop version

成功则会显示 Hadoop 版本信息：

bash

hadoop@cuper-Inspiron-7591:/usr/local/hadoop$ ./bin/hadoop version
Hadoop 3.1.3
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579
Compiled by ztang on 2019-09-12T02:47Z
Compiled with protoc 2.5.0
From source with checksum ec785077c385118ac91aadde5ec9799
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.1.3.jar

Hadoop单机配置(非分布式)

Hadoop 默认模式为非分布式模式（本地模式），无需进行其他配置即可运行。非分布式即单 Java 进程，方便进行调试。

现在我们可以执行例子来感受下 Hadoop 的运行。Hadoop 附带了丰富的例子（运行 ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar 可以看到所有例子的介绍），包括 wordcount、terasort、join、grep 等（顺便简单翻译一下）：

bash

hadoop@cuper-Inspiron-7591:/usr/local/hadoop$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar
An example program must be given as the first argument.#必须给出一个示例程序作为第一个参数。
Valid program names are:#有效的程序名为：
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.#一个基于聚合的map/reduce程序，对输入文件中的单词进行计数。
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.#一个基于聚合的map/reduce程序，用于计算输入文件中单词的直方图。
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.#一个map/reduce程序，使用Bailey-Borwein-Plouffe计算Pi的精确位数。
  dbcount: An example job that count the pageview counts from a database.#从数据库中统计页面视图计数的示例作业。
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.#一个map/reduce程序，使用BBP类型的公式计算Pi的精确位。
  grep: A map/reduce program that counts the matches of a regex in the input.#一个map/reduce程序，计算输入中正则表达式的匹配项。
  join: A job that effects a join over sorted, equally partitioned datasets#对已排序、分区相等的数据集执行join的作业
  multifilewc: A job that counts words from several files.#从多个文件中计算字数的作业。
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.#一个map/reduce瓷砖铺设程序，以找到pentomino问题的解决方案。
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.#使用准蒙特卡罗方法估计pi的map/reduce程序。
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.#map/reduce程序，每个节点写入10GB的随机文本数据。
  randomwriter: A map/reduce program that writes 10GB of random data per node.#一个map/reduce程序，每个节点写入10GB的随机数据。
  secondarysort: An example defining a secondary sort to the reduce.#定义reduce的二级排序的示例。
  sort: A map/reduce program that sorts the data written by the random writer.#一个map/reduce程序，对随机写入的数据进行排序。
  sudoku: A sudoku solver.#数独
  teragen: Generate data for the terasort#为terasort生成数据
  terasort: Run the terasort#运行terasort
  teravalidate: Checking results of terasort#检查terasort的结果
  wordcount: A map/reduce program that counts the words in the input files.#一个对输入文件中的单词进行计数的map/reduce程序。
  wordmean: A map/reduce program that counts the average length of the words in the input files.#一个map/reduce程序，计算输入文件中单词的平均长度。
  wordmedian: A map/reduce program that counts the median length of the words in the input files.#一个map/reduce程序，用于计算输入文件中单词的中值长度。
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.#一个map/reduce程序，计算输入文件中单词长度的标准偏差。

在此我们选择运行 grep 例子，我们新建一个 input 文件夹，并将hadoop的配置文件作为输入文件复制到input中，把其中的所有文件作为输入：

bash

1
2
3

cd /usr/local/hadoop
mkdir ./input
cp ./etc/hadoop/*.xml ./input   # 将配置文件作为输入文件

复制后的input文件夹大致是这个样子：

筛选当中符合正则表达式 dfs[a-z.]+ 的单词并统计出现的次数，运行情况大致如下：

bash

hadoop@cuper-Inspiron-7591:/usr/local/hadoop$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep ./input ./output 'dfs[a-z.]+'
2020-07-21 11:49:52,083 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2020-07-21 11:49:52,141 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-07-21 11:49:52,141 INFO impl.MetricsSystemImpl: JobTracker metrics system started

*******************省略一大段执行信息****************
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=123
	File Output Format Counters 
		Bytes Written=23

最后输出结果到 output 文件夹中：

bash

1
2

hadoop@cuper-Inspiron-7591:/usr/local/hadoop$ cat ./output/*  # 查看运行结果
1	dfsadmin  #执行成功后输出了作业的相关信息，输出的结果是符合正则的单词 dfsadmin 出现了1次

output文件夹大致是这个样子：

注：Hadoop 默认不会覆盖结果文件，因此再次运行上面实例会提示出错，需要先将 ./output 用rm -r ./output删除。

Hadoop伪分布式配置

Hadoop 可以在单节点上以伪分布式的方式运行，Hadoop 进程以分离的 Java 进程来运行，节点既作为 NameNode 也作为 DataNode，同时，读取的是 HDFS 中的文件。

Hadoop 的配置文件位于 /usr/local/hadoop/etc/hadoop/ 中，伪分布式需要修改2个配置文件 core-site.xml 和 hdfs-site.xml 。Hadoop的配置文件是 xml 格式，每个配置以声明 property 的 name 和 value 的方式来实现。

修改配置文件 core-site.xml ，将当中的：

xml

1 2	<configuration> </configuration>

修改为下面配置：

xml

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

同样的，修改配置文件 hdfs-site.xml，将当中的：

xml

1
2
3

<configuration>

</configuration>

修改为：

xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>

Hadoop配置文件说明

Hadoop 的运行方式是由配置文件决定的（运行 Hadoop 时会读取配置文件），因此如果需要从伪分布式模式切换回非分布式模式，需要删除 core-site.xml 中的配置项。

此外，伪分布式虽然只需要配置 fs.defaultFS 和 dfs.replication 就可以运行（官方教程如此），不过若没有配置 hadoop.tmp.dir 参数，则默认使用的临时目录为 /tmp/hadoo-hadoop，而这个目录在重启时有可能被系统清理掉，导致必须重新执行 format 才行。所以我们进行了设置，同时也指定 dfs.namenode.name.dir 和 dfs.datanode.data.dir，否则在接下来的步骤中可能会出错。

配置完成后，执行 NameNode 的格式化：

bash

1 2	cd /usr/local/hadoop ./bin/hdfs namenode -format

成功的话，会看到 “successfully formatted” 的提示，具体返回信息类似如下：

bash

hadoop@cuper-Inspiron-7591:/usr/local/hadoop$ ./bin/hdfs namenode -format
WARNING: /usr/local/hadoop/logs does not exist. Creating.
2020-07-21 13:01:34,335 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = cuper-Inspiron-7591/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.1.3
******************省略一大段执行信息*********************
2020-07-21 13:01:34,848 INFO common.Storage: Storage directory /usr/local/hadoop/tmp/dfs/name has been successfully formatted.
2020-07-21 13:01:34,871 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2020-07-21 13:01:34,924 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 393 bytes saved in 0 seconds .
2020-07-21 13:01:34,933 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2020-07-21 13:01:34,935 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid = 0 when meet shutdown.
2020-07-21 13:01:34,935 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at cuper-Inspiron-7591/127.0.1.1
************************************************************/

接着开启 NameNode 和 DataNode 守护进程。

bash

1 2	cd /usr/local/hadoop ./sbin/start-dfs.sh #start-dfs.sh是个完整的可执行文件，中间没有空格

在这里反而出现了参考文章里的上一步提到的那个**ERROR: JAVA_HOME is not set and could not be found.**于是根据教程去hadoop的安装目录修改配置文件“/usr/local/hadoop/etc/hadoop/hadoop-env.sh”，在里面找到“export JAVA_HOME=${JAVA_HOME}”这行，但我找到的是这样的：

shell

1	# export JAVA_HOME=

于是去掉注释再改上我的Java安装路径：

shell

1	export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_191

然后，再次运行守护进程（没有重启hadoop），warning可以不用管：

bash

hadoop@cuper-Inspiron-7591:/usr/local/hadoop$ ./sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Starting datanodes
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Starting secondary namenodes [cuper-Inspiron-7591]
cuper-Inspiron-7591: Warning: Permanently added 'cuper-inspiron-7591' (ECDSA) to the list of known hosts.

启动完成后，可以通过命令 jps 来判断是否成功启动，若成功启动则会列出如下进程: “NameNode”、”DataNode” 和 “SecondaryNameNode”：

bash

hadoop@cuper-Inspiron-7591:/usr/local/hadoop$ jps
15953 NameNode
16130 DataNode
16344 SecondaryNameNode
17100 Jps

如果 SecondaryNameNode 没有启动，请运行 sbin/stop-dfs.sh 关闭进程，然后再次尝试启动尝试。如果没有 NameNode 或 DataNode ，那就是配置不成功，请仔细检查之前步骤，或通过查看启动日志排查原因。

成功启动后，可以访问 Web 界面 http://localhost:9870 查看 NameNode 和 Datanode 信息，还可以在线查看 HDFS 中的文件：

运行Hadoop伪分布式实例

上面的单机模式，grep 例子读取的是本地数据，伪分布式读取的则是 HDFS 上的数据。要使用 HDFS，首先需要在 HDFS 中创建用户目录：

bash

1	./bin/hdfs dfs -mkdir -p /user/hadoop

接着将 ./etc/hadoop 中的 xml 文件作为输入文件复制到分布式文件系统中，即将 /usr/local/hadoop/etc/hadoop 复制到分布式文件系统中的 /user/hadoop/input 中。我们使用的是 hadoop 用户，并且已创建相应的用户目录 /user/hadoop ，因此在命令中就可以使用相对路径如 input，其对应的绝对路径就是 /user/hadoop/input:

bash

1 2	./bin/hdfs dfs -mkdir input ./bin/hdfs dfs -put ./etc/hadoop/*.xml input

复制完成后，可以通过./bin/hdfs dfs -ls input命令查看文件列表，列表情况大致如下：

bash

hadoop@cuper-Inspiron-7591:/usr/local/hadoop$ ./bin/hdfs dfs -ls input
Found 9 items
-rw-r--r--   1 hadoop supergroup       8260 2020-07-21 13:23 input/capacity-scheduler.xml
-rw-r--r--   1 hadoop supergroup       1075 2020-07-21 13:23 input/core-site.xml
-rw-r--r--   1 hadoop supergroup      11392 2020-07-21 13:23 input/hadoop-policy.xml
-rw-r--r--   1 hadoop supergroup       1133 2020-07-21 13:23 input/hdfs-site.xml
-rw-r--r--   1 hadoop supergroup        620 2020-07-21 13:23 input/httpfs-site.xml
-rw-r--r--   1 hadoop supergroup       3518 2020-07-21 13:23 input/kms-acls.xml
-rw-r--r--   1 hadoop supergroup        682 2020-07-21 13:23 input/kms-site.xml
-rw-r--r--   1 hadoop supergroup        758 2020-07-21 13:23 input/mapred-site.xml
-rw-r--r--   1 hadoop supergroup        690 2020-07-21 13:23 input/yarn-site.xml

伪分布式运行 MapReduce 作业的方式跟单机模式相同，区别在于伪分布式读取的是HDFS中的文件（可以将单机步骤中创建的本地 input 文件夹、输出结果 output 文件夹都删掉来验证这一点）：

bash

1	./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep input output 'dfs[a-z.]+'

运行这个程序的结果和单机模式的类似，输出一大串的执行信息，然后使用以下命令查看运行结果（查看的是位于 HDFS 中的输出结果）：

bash

1	./bin/hdfs dfs -cat output/*

输出结果如下（注意到刚才我们已经更改了配置文件，所以运行结果不同）：

bash

hadoop@cuper-Inspiron-7591:/usr/local/hadoop$ ./bin/hdfs dfs -cat output/*
2020-07-21 13:26:48,459 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
1	dfsadmin
1	dfs.replication
1	dfs.namenode.name.dir
1	dfs.datanode.data.dir

将运行结果取回到本地：

bash

1
2
3

rm -r ./output    # 先删除本地的 output 文件夹（如果存在）
./bin/hdfs dfs -get output ./output     # 将 HDFS 上的 output 文件夹拷贝到本机
cat ./output/*

注意：Hadoop 运行程序时，输出目录不能存在（不是本地的output），否则会提示错误 “org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists” ，因此若要再次执行，需要执行如下命令删除 output 文件夹:

bash

1	./bin/hdfs dfs -rm -r output # 删除 output 文件夹

然后这个程序执行就结束了，若要关闭 Hadoop，则运行：

bash

1	./sbin/stop-dfs.sh

注意：下次启动 hadoop 时，无需进行 NameNode 的初始化，只需要运行 ./sbin/start-dfs.sh 就可以。

Hadoop3.1.3安装教程：单机&伪分布式配置

Intro

环境

创建hadoop用户

安装SSH、配置SSH无密码登陆

安装Java环境

安装 Hadoop3.1.3

Hadoop单机配置(非分布式)

Hadoop伪分布式配置

运行Hadoop伪分布式实例

预览:

Intro

环境

创建hadoop用户

安装SSH、配置SSH无密码登陆

安装Java环境

安装 Hadoop3.1.3

Hadoop单机配置(非分布式)

Hadoop伪分布式配置

运行Hadoop伪分布式实例

微信扫一扫：分享

预览: