Hadoop

1 安装配置过程

1.1 安装配置树莓派

(1). 软硬件准备

  1. 树莓派

  2. SD 卡格式工具

  3. 树莓派官方系统烧录工具

  4. 树莓派操作系统,建议选择官方系统

(2). 烧录软件

image-20220613092648337

配置ssh和设置wifi(对应的主机名分别设置为master,slave01,slave02)

image-20220613092723152

image-20220613092747268

(3). 连接配置网络

​ 打开手机下载软件某热点软件,可以查看树莓派的ip

image-20220613111801789

​ 然后输入ssh nudt@192.168.225.211.输入密码ssh

1.2 安装配置jdk

(1). 将下载的jdk-8u241-linux-arm64-vfp-hflt.tar.gz,通过termius传递到三台树莓派上。

image-20220613113424523

(2). 解压

tar -zxvf jdk-8u241-linux-arm64-vfp-hflt.tar.gz
sudo mkdir /usr/lib/jvm/
sudo mv jdk1.8.0_241/ /usr/lib/jvm/

(3). 配置环境变量

配置的文件为/etc/profile

# sudo vim /etc/profile
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_241
export CLASSPATH=".:$JAVA_HOME/lib:$CLASSPATH"
export PATH="$JAVA_HOME/bin:$PATH"

使他生效

source /etc/profile

(4). 设置系统默认jdk

sudo update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.8.0_241/bin/java 300 
sudo update-alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.8.0_241/bin/javac 300
sudo update-alternatives --install /usr/bin/jar jar /usr/lib/jvm/jdk1.8.0_241/bin/jar 300
sudo update-alternatives --install /usr/bin/javah javah /usr/lib/jvm/jdk1.8.0_241/bin/javah 300
sudo update-alternatives --install /usr/bin/javap javap /usr/lib/jvm/jdk1.8.0_241/bin/javap 300
sudo update-alternatives --config java

(5). 验证java安装成功

image-20220613115053096

1.3 安装配置hadoop

(1). 下载(只需要在master上做

或者使用压缩包

wget --no-check-certificate https://repo.huaweicloud.com/apache/hadoop/common/hadoop-3.3.2/hadoop-3.3.2.tar.gz

可以使用我提供的压缩包。

(2). 解压(只需要在master上做)

tar -zxvf hadoop-3.3.2.tar.gz
mv hadoop-3.3.2 ~/hadoop/

(3). 配置并启动hadoop的环境变量,source /etc/profile.

sudo vim /etc/profile
export HADOOP_HOME=/home/nudt/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path-$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH

(4). 验证hadoop是否安装成功(只需要在master)

$ hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
where CLASSNAME is a user-provided Java class

OPTIONS is none or any of:

buildpaths attempt to add class files from build tree
--config dir Hadoop config directory
--debug turn on shell script debug mode
--help usage information
hostnames list[,of,host,names] hosts to use in slave mode
hosts filename list of hosts to use in slave mode
loglevel level set the log4j level for this command
workers turn on worker mode

SUBCOMMAND is one of:


Admin Commands:

daemonlog get/set the log level for each daemon

Client Commands:

archive create a Hadoop archive
checknative check native Hadoop and compression libraries availability
classpath prints the class path needed to get the Hadoop jar and the
required libraries
conftest validate configuration XML files
credential interact with credential providers
distch distributed metadata changer
distcp copy file or directories recursively
dtutil operations related to delegation tokens
envvars display computed Hadoop environment variables
fs run a generic filesystem user client
gridmix submit a mix of synthetic job, modeling a profiled from
production load
jar <jar> run a jar file. NOTE: please use "yarn jar" to launch YARN
applications, not this command.
jnipath prints the java.library.path
kdiag Diagnose Kerberos Problems
kerbname show auth_to_local principal conversion
key manage keys via the KeyProvider
rumenfolder scale a rumen input trace
rumentrace convert logs into a rumen trace
s3guard manage metadata on S3
trace view and modify Hadoop tracing settings
version print the version

Daemon Commands:

kms run KMS, the Key Management Server
registrydns run the registry DNS server

SUBCOMMAND may print help when invoked w/o parameters or with -h.

(5). :star:修改主机名和配置网络映射

主机名在初始化配置的时候我已经要求设置了,只能做重复讲解。sudo vim /etc/hostname

master或者slave01或者slave02

修改sudo vim /etc/hosts

192.168.239.28 master
192.168.239.211 slave01
192.168.239.254 slave02

修改网络映射sudo vim /etc/cloud/templates/hosts.debian.tmpl注意!!!只保留下面这些数据,其他ipv4的数据全删除!!!)

## template:jinja
{#
This file (/etc/cloud/templates/hosts.debian.tmpl) is only utilized
if enabled in cloud-config. Specifically, in order to enable it
you need to add the following to config:
manage_etc_hosts: True
-#}
# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.debian.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
# /etc/cloud/cloud.cfg or cloud-config from user-data
#
{# The value '{{hostname}}' will be replaced with the local-hostname -#}
#127.0.1.1 {{fqdn}} {{hostname}}
#127.0.0.1 localhost
192.168.239.28 master
192.168.239.211 slave01
192.168.239.254 slave02
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

注意这里的网段设置一定要和你ssh连接上去的同步最后

scp /etc/cloud/templates/hosts.debian.tmpl root@slave01:/etc/cloud/templates/hosts.debian.tmpl
scp /etc/cloud/templates/hosts.debian.tmpl root@slave02:/etc/cloud/templates/hosts.debian.tmpl
scp /etc/hosts root@slave01:/etc/hosts
scp /etc/hosts root@slave02:/etc/hosts

(6). 配置ssh免密登录

ssh-keygen -t rsa(一直回车就行)
cd /home/nudt/.ssh
cat id_rsa.pub >> authorized_keys
ssh-copy-id -i ./id_rsa.pub nudt@slave01(这里就是一台主机对于两外两台)
ssh-copy-id -i ./id_rsa.pub nudt@slave02

现在可以开到下面这些文件

image-20220613193638490

这里我们需要将所有的主机之间开通

image-20220613194001823

下面测试远程登陆免密码。

image-20220613194032730

测试成功。

(7).配置hadoop

​ a). 配置core-site.xml

​ 该配置文件属于 Hadoop 的全局配置文件,我们主要进行配 置分布式文件系统 HDFS 的入口地址(即 NameNode 的地址)和 HDFS 运行时所生产数 据的保存位置

cd /home/nudt/hadoop/etc/hadoop
vim core-site.xml
mkdir /home/nudt/hadoop_data/
mkdir /home/nudt/hadoop_data/tmp
mkdir /home/nudt/hadoop_data/dfs/
mkdir /home/nudt/hadoop_data/dfs/name
mkdir /home/nudt/hadoop_data/dfs/data
sudo mkdir /usr/container/logs

将下面的内容修改后粘贴进去

<configuration>
<!-- 指定 HDFS 中 NameNode 的地址 -->
<property>
<name>fs.defaultFS</name>
<!-- 其中 hdfs 为协议名称,master 为 namenode 的节点主机名称,端口号为9000 -->
<value>hdfs://master:9000</value>
</property>
<!-- 指定 hadoop 运行时产生文件的存储目录,该目录需要单独创建 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/nudt/hadoop_data/tmp</value>
</property>
</configuration>

参数说明

  • fs.defaultFS(指定 HDFS 中 NameNode 的地址)
  • hadoop.tmp.dir(指定 hadoop 运行时产生文件的存储目录)

​ b). 配置 hdfs-site.xml 文件

<configuration>
<property>
<!-- 该参数是配置 NameNode 的 http 访问地址和端口号 -->
<name>dfs.namenode.http-address</name>
<value>192.168.239.28:50070</value>
</property>
<!-- 该参数是配置 该参数是配置 SecondaryNameNode 的 http 访
问地址和端口号 的 http 访问地址和端口号 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>192.168.239.28:50090</value>
</property>
<!-- 该参数是配置 HDFS 副本数量。 -->
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!-- 该参数是设置 NameNode 存放的路径。 -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/nudt/hadoop_data/dfs/name</value>
</property>
<!-- dfs.datanode.data.dir:该参数是设置 DataNode 存放的路径。 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/nudt/hadoop_data/dfs/data</value>
</property>
</configuration>

这里只有一个参数需要说明,因为我们一共3台机器,配置只有一个master和两个slave,所有secondaryNameNode也是master.

​ c) yarn-site.xml

<configuration>
<property>
<name>yarn.resourcemanager.hostsname</name>
<value>master</value>
</property>
<!-- ResourceManager 服务器的
web 地址和端口。 -->
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
<!-- 指定 NodeManager 启动时加载 server 的 方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定使用 mapreduce_shuffle 中的类。 -->
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!-- 配置是否启用日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 配置聚集的日志在 HDFS 上保存的最长时间。 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>106800</value>
</property>
<!-- 指定日志聚合目录 -->
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/usr/container/logs</value>
</property>
</configuration>

​ d) mapred-site.xml

<configuration>
<!-- -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>${hadoop.tmp.dir}/mr-history/tmp</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>${hadoop.tmp.dir}/mr-history/done</value>
</property>
</configuration>

​ e) 配置workes

master
slave01
slave02

​ f)现在开始分发到各个从机器

scp -r  /home/nudt/hadoop nudt@slave01:/home/nudt/
scp -r /home/nudt/hadoop nudt@slave02:/home/nudt/

​ 时间比较长,耐心等待….

1.4 验证安装

在启动hadoop集群之前需要先格式化namenode

hdfs namenode -format

1)启动和停止 HDFS

start-dfs.sh # 启动 HDFS
stop-dfs.sh # 停止 HDFS
  1. 启动和停止 Yarn
start-yarn.sh # 启动 YARN
stop-yarn.sh # 停止 YARN
  1. 全部暂停或启动
start-all.sh # 启动 HDFS 和 YARN
stop-all.sh # 停止 HDFS 和 YARN
  1. 启动和停止历史(日志)服务器
mr-jobhistory-daemon.sh start historyserver # 启动 historyserver
mr-jobhistory-daemon.sh start historyserver # 停止 historyserver
  1. 查看jbs
jps

1.5 交换机配置(可选)

(1). 配置windows机器的固定ip

右键打开网络设置,选择更改适配器

image-20220613204419805

选择以太网,右键属性,选择interel桥接协议ipv4

image-20220613205450441

按照如下设置

image-20220613205533184

(2).配置树莓派的三台ip(注意和之前的保持一致)

  1. 插入电脑,打开system-boot盘
    image-20220613205627741

  2. 编辑cmdline.txt

    image-20220613205714223

添加图中这一行,重复此步骤,注意一定要和上面的对应起来,不然你都得全配。然后一切恢复正常。

1.6 问题解决

a. JAVA_HOME没有设置

vim /home/nudt/hadoop/etc/hadoop/hadoop-env.sh
(加入这一句)export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_241

b. master: Permission denied

原因不是权限不够,而是未将master所使用的用户的公钥加到相应主机下面

  • 先调整到root用户

    sudo passwd(密码设置为root)
    su root
  • 生成密钥对

    ssh-keygen -t rsa -P ''
  • 写入本机的root的信任

    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • 发送公钥

    首先在slave机器上也执行调整为root过程,修改/etc/ssh/sshd_config文件。

    sudo vim /etc/ssh/sshd_config

    找到PermitRootLogin prohibit-password.

    image-20220613234132124

    将这个修改为如下

    image-20220613234332312

    重启服务service sshd restart.然后后面操作的时候都用root用户!。

  • 最后添加配置(root用户)

    sudo vim /etc/profile

    export HDFS_NAMENODE_USER=root
    export HDFS_DATANODE_USER=root
    export HDFS_SECONDARYNAMENODE_USER=root
    export YARN_RESOURCEMANAGER_USER=root
    export YARN_NODEMANAGER_USER=root

    最后加入这个。这里你想用什么用户登录创建,你就用什么用户,不必用root

c. 解决could only be written to 0 of the 1 minReplicati

参考:https://blog.csdn.net/sinat_38737592/article/details/101628357

2. hadoop使用

2.1 hadoop文件基础操作

SHELL 命令

远程访问的时候增加参数

-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-fs hdfs://master:9000

创建文件

hadoop fs -touch /tmp/exp.tx

写入文件

echo "<Text to append>" | hadoop fs -appendToFile - /aaa/aa.txt 
hadoop fs -appendToFile {src} {dst}

删除文件

hadoop fs -rm README.txt

下载文件

[-get [-f] [-p] [-crc] [-ignoreCrc] [-t <thread count>] [-q <thread pool queue size>] <src> ... <localdst>]
hadoop fs -get <src> <localdst>
最重要-t 可以设置进程

重命名/移动

-mv

复制

-cp

查看详细信息

-cat

配置权限(!!必须做!!)

hadoop fs -chmod 777 /

更多查看 hadoop fs --help

java代码操作

import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.RemoteIterator;
import org.apache.hadoop.util.Progressable;
import org.apache.kerby.util.PublicKeyDeriver;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.*;
import java.net.URI;
import java.net.URISyntaxException;
import java.nio.charset.StandardCharsets;
public class FileManager {
public static String nameNode = "192.168.239.28:9000";
public static URI hdfsHost;
static{
try {
hdfsHost = new URI("hdfs://192.168.239.28:9000");
} catch (URISyntaxException e) {
e.printStackTrace();
}
}
public static void createHelloWorld(Configuration cf,String filePath) throws IOException, URISyntaxException {
FileSystem fs = FileSystem.get(hdfsHost,cf);
byte[] buff = "Hello World".getBytes(StandardCharsets.UTF_8);
if(!fs.exists(new Path(filePath))){
FSDataOutputStream fos = fs.create(new Path(filePath));
fos.write(buff,0,buff.length);
System.out.println("Create a new File:" + filePath +" with HelloWord");
fos.close();
}else{
System.out.println("Will Overwrti file:\t" + filePath);
System.out.println("Add contents to :\t" + filePath +" with HelloWord");
FSDataOutputStream fos = fs.create(new Path(filePath),true);
fos.write(buff,0,buff.length);
fos.close();
}
fs.close();
}

public static void fileExist(Configuration cf, String filePath) throws IOException{
FileSystem fs = FileSystem.get(hdfsHost,cf);
if(fs.exists(new Path(filePath))){
System.out.println(filePath + "\tExists!");
}else{
System.out.println(filePath + "\tNot Exists!");
}
}

public static void readFile(Configuration cf,String filePath) throws IOException {
FileSystem fs = FileSystem.get(hdfsHost,cf);
FSDataInputStream open = fs.open(new Path(filePath));
BufferedReader bfr = new BufferedReader(new InputStreamReader(open));
System.out.println("Begin Read:" + filePath);
String contentLine = bfr.readLine();
while (contentLine != null) {
System.out.println(contentLine);
contentLine = bfr.readLine();
}
}
/*
* 删除的路径Path f,一个是是否递归(recursive)
*/
public static void delteFile(Configuration cf,String filePath) throws IOException{
FileSystem fs =FileSystem.get(hdfsHost,cf);
if(fs.delete(new Path(filePath), false)){
System.out.println(filePath + "\tdelete success!");
}else{
System.out.println(filePath + "\tdelete fail!");
}
}

public static void showDir(Configuration cf , String filePath) throws IOException{
FileSystem fs =FileSystem.get(hdfsHost,cf);
RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path(filePath), false);
//递归列出该目录下所有文件,不包括文件夹,后面的布尔值为是否递归
while(listFiles.hasNext()) {//如果listfiles里还有东西
LocatedFileStatus next = listFiles.next();//得到下一个并pop出listFiles
System.out.println(next.getPath().getName());//输出
}
//列出目录下所有的文件以及文件夹
FileStatus[] listStatus = fs.listStatus(new Path(filePath));//获取目录状态
for(FileStatus list:listStatus) {//增强for循环遍历listStatus
System.out.println(list.getPath().getName());//输出
}
}

public static void uploadFile(Configuration cf, String localstr, String dst) throws Exception{
FileSystem fs =FileSystem.get(hdfsHost,cf);
InputStream in = new FileInputStream(localstr);
OutputStream out = fs.create(new Path(dst), new Progressable() {
@Override
public void progress() {
System.out.println("上传完一个设定缓存区大小容量的文件!");
}
});

IOUtils.copyBytes(in, out, cf);
System.out.println("LocalFile:\t" + localstr+"\tupload to:" + dst);
}

public static void downloadFile(Configuration cf, String remoteStr, String localString) throws Exception{
FileSystem fs =FileSystem.get(hdfsHost,cf);
InputStream in = fs.open(new Path(remoteStr));
OutputStream out = new FileOutputStream(localString);
IOUtils.copyBytes(in, out, cf);
System.out.println("downloadFile:\t"+remoteStr +"to " + localString);
}
public static void main(String[] args) throws Exception {
Configuration cf = new Configuration();
String path = "/tmp/dem0.txt";
cf.set("ds.defaultFs","hdfs://"+nameNode);
System.out.println("[*]createHelloWorld:");
createHelloWorld(cf,path);
System.out.println("[*]showDir:");
showDir(cf, "/tmp");
System.out.println("[*]fileExist:");
fileExist(cf, path);
System.out.println("[*]readFile:");
readFile(cf, path);
System.out.println("[*]delteFile:");
delteFile(cf, path);
System.out.println("[*]showDir:");
showDir(cf, "/");
System.out.println("[*]uploadFile:");
uploadFile(cf, "/etc/passwd", "/tmp/passwd");

}
}

2.2 vcode+maven+hadoop开发环境配置

基于iotdevelop环境

  • 安装vcode

    wget https://vscode.cdn.azure.cn/stable/4af164ea3a06f701fe3e89a2bcbb421d2026b68f/code_1.68.0-1654690107_amd64.deb?1 -o code.deb
    sudo dpkg -i ./code.deb
  • 安装maven

    sudo apt-get install maven
    export M2_HOME=/usr/share/maven(这一句加入/etc/profile)
  • 配置maven的阿里源

    详情参照参考资料

    sudo vim /usr/share/maven/conf/settings.xml 

    image-20220614151033999

  • 配置vcode

    下载插件Java Extension Pack.

    image-20220614161346645

    开始配置

    image-20220614161426059

    image-20220614161441367

  • 新建项目

    在空白区域右键

    image-20220614161523396

    image-20220614161540381

    image-20220614161556634

​ 剩下两个选项输入自己想要输入的内容。点击

image-20220614161628479

选择你要存放的目录。然后需要等待。

image-20220614161807103

直接回车就可以了。然后导入依赖。

<!-- 导入hadoop依赖环境 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-api</artifactId>
<version>${hadoop.version}</version>
</dependency>

image-20220614162024223

注意所处的位置!!然后等待就行了。加载完毕之后就可以写代码了。

参考资料

1.hdfs命令行操作:https://zhuanlan.zhihu.com/p/271098213

2.hdfs代码操作:https://blog.csdn.net/little_sloth/article/details/107040607

3.vcode+maven+hadoop开发:https://www.cnblogs.com/orion-orion/p/15664772.html

4.ubuntu安装maven:https://cloud.tencent.com/developer/article/1649751

5.java权限:https://blog.csdn.net/qq_43541746/article/details/115422142

6.mapreduce入门:https://www.runoob.com/w3cnote/mapreduce-coding.html