数仓运维:Hadoop 集群配置错误导致的报错及处理办法

Hadoop宕机

(1)如果MR造成系统宕机。此时要控制Yarn同时运行的任务数,和每个任务申请的最大内存。调整参数:yarn.scheduler.maximum-allocation-mb(单个任务可申请的最多物理内存量,默认是8192MB)

(2)如果写入文件过量造成NameNode宕机。那么调高Kafka的存储大小,控制从Kafka到HDFS的写入速度。高峰期的时候用Kafka进行缓存,高峰期过去数据同步会自动跟上。

报错:no YARN_RESOURCEMANAGER_USER defined

  • 报错内容: ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation.
  • 解决办法:

    cd /usr/local/hadoop-3.1.3/sbin/
    
      vim start-dfs.sh
        
      # 在文件顶部添加以下参数
      HDFS_NAMENODE_USER=root
      HDFS_DATANODE_USER=root
      HDFS_SECONDARYNAMENODE_USER=root
      YARN_RESOURCEMANAGER_USER=root
      YARN_NODEMANAGER_USER=root
    
      vim stop-dfs.sh
        
      # 在文件顶部添加以下参数
      HDFS_NAMENODE_USER=root
      HDFS_DATANODE_USER=root
      HDFS_SECONDARYNAMENODE_USER=root
      YARN_RESOURCEMANAGER_USER=root
      YARN_NODEMANAGER_USER=root
    
      vim start-yarn.sh
        
      # 在文件顶部添加以下参数
      YARN_RESOURCEMANAGER_USER=root
      HADOOP_SECURE_DN_USER=yarn
      YARN_NODEMANAGER_USER=root
    
      vim stop-yarn.sh
        
      # 在文件顶部添加以下参数
      YARN_RESOURCEMANAGER_USER=root
      HADOOP_SECURE_DN_USER=yarn
      YARN_NODEMANAGER_USER=root
    

Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

  • 写入数据到 hive 时报错,提示 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask,报错内容如下方所示:

      INFO  : Compiling command(queryId=root_20220702214441_90f37f30-1f88-4147-bc45-832d0018fa04): insert into student_hdfs values (950122, '华少','男',30,'IS')
      INFO  : Concurrency mode is disabled, not creating a lock manager
      INFO  : Semantic Analysis Completed (retrial = false)
      INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col1, type:int, comment:null), FieldSchema(name:col2, type:string, comment:null), FieldSchema(name:col3, type:string, comment:null), FieldSchema(name:col4, type:int, comment:null), FieldSchema(name:col5, type:string, comment:null)], properties:null)
      INFO  : Completed compiling command(queryId=root_20220702214441_90f37f30-1f88-4147-bc45-832d0018fa04); Time taken: 0.633 seconds
      INFO  : Concurrency mode is disabled, not creating a lock manager
      INFO  : Executing command(queryId=root_20220702214441_90f37f30-1f88-4147-bc45-832d0018fa04): insert into student_hdfs values (950122, '华少','男',30,'IS')
      WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
      INFO  : Query ID = root_20220702214441_90f37f30-1f88-4147-bc45-832d0018fa04
      INFO  : Total jobs = 3
      INFO  : Launching Job 1 out of 3
      INFO  : Starting task [Stage-1:MAPRED] in serial mode
      INFO  : Number of reduce tasks determined at compile time: 1
      INFO  : In order to change the average load for a reducer (in bytes):
      INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
      INFO  : In order to limit the maximum number of reducers:
      INFO  :   set hive.exec.reducers.max=<number>
      INFO  : In order to set a constant number of reducers:
      INFO  :   set mapreduce.job.reduces=<number>
      INFO  : number of splits:1
      INFO  : Submitting tokens for job: job_1656766458051_0004
      INFO  : Executing with tokens: []
      INFO  : The url to track the job: http://hadoop103:8088/proxy/application_1656766458051_0004/
      INFO  : Starting Job = job_1656766458051_0004, Tracking URL = http://hadoop103:8088/proxy/application_1656766458051_0004/
      INFO  : Kill Command = /usr/local/hadoop-3.1.3/bin/mapred job  -kill job_1656766458051_0004
      INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
      INFO  : 2022-07-02 21:45:05,409 Stage-1 map = 100%,  reduce = 100%
      ERROR : Ended Job = job_1656766458051_0004 with errors
      ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
      INFO  : MapReduce Jobs Launched: 
      INFO  : Stage-Stage-1: Map: 1  Reduce: 1   HDFS Read: 0 HDFS Write: 0 FAIL
      INFO  : Total MapReduce CPU Time Spent: 0 msec
      INFO  : Completed executing command(queryId=root_20220702214441_90f37f30-1f88-4147-bc45-832d0018fa04); Time taken: 25.682 seconds
      INFO  : Concurrency mode is disabled, not creating a lock manager
      Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
    
  • 注意:下方的报错并不是实际报错,需要到 yarn web ui 集群查看真实的错误。

  • 真实的报错内容:

      Application application_1656759536632_0001 failed 2 times due to AM Container for appattempt_1656759536632_0001_000002 exited with exitCode: 1
      Failing this attempt.Diagnostics: [2022-07-02 19:00:15.757]Exception from container-launch.
      Container id: container_1656759536632_0001_02_000001
      Exit code: 1
      [2022-07-02 19:00:15.799]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
      Last 4096 bytes of prelaunch.err :
      Last 4096 bytes of stderr :
      Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
      Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
      <property>
      <name>yarn.app.mapreduce.am.env</name>
      <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      <property>
      <name>mapreduce.map.env</name>
      <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      <property>
      <name>mapreduce.reduce.env</name>
      <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      [2022-07-02 19:00:15.801]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
      Last 4096 bytes of prelaunch.err :
      Last 4096 bytes of stderr :
      Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
      Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
      <property>
      <name>yarn.app.mapreduce.am.env</name>
      <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      <property>
      <name>mapreduce.map.env</name>
      <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      <property>
      <name>mapreduce.reduce.env</name>
      <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      For more detailed output, check the application tracking page: http://hadoop103:8088/cluster/app/application_1656759536632_0001 Then click on links to logs of each attempt.
      . Failing the application.
    
  • 错误解决办法:

    • 下面的方法我尝试了,不可行。
      • mapred-site.xml 给 HADOOP_MAPRED_HOME 配置绝对路径
           <property>
             <name>yarn.app.mapreduce.am.env</name>
             <value>HADOOP_MAPRED_HOME=/opt/module/hadoop-3.1.3</value>
           </property>
           <property>
             <name>mapreduce.map.env</name>
             <value>HADOOP_MAPRED_HOME=/opt/module/hadoop-3.1.3</value>
           </property>
           <property>
             <name>mapreduce.reduce.env</name>
             <value>HADOOP_MAPRED_HOME=/opt/module/hadoop-3.1.3</value>
            </property>
          </configuration>
        
      • 配置 hadoop classpath
          hadoop classpath
        
        • 获取后的路径 : 替换成 ,,写入 yarn-site.xml,如下:
            <property>
              <name>yarn.application.classpath</name>
              <value>/usr/local/hadoop-3.1.3/etc/hadoop,/usr/local/hadoop-3.1.3/share/hadoop/common/lib/*,/usr/local/hadoop-3.1.3/share/hadoop/common/*,/usr/local/hadoop-3.1.3/share/hadoop/hdfs,/usr/local/hadoop-3.1.3/share/hadoop/hdfs/lib/*,/usr/local/hadoop-3.1.3/share/hadoop/hdfs/*,/usr/local/hadoop-3.1.3/share/hadoop/mapreduce/lib/*,/usr/local/hadoop-3.1.3/share/hadoop/mapreduce/*,/usr/local/hadoop-3.1.3/share/hadoop/yarn,/usr/local/hadoop-3.1.3/share/hadoop/yarn/lib/*,/usr/local/hadoop-3.1.3/share/hadoop/yarn/*</value>
            </property>
            <property>
              <name>yarn.nodemanager.env-whitelist</name>,
              <value>JAVA_HOME,HADOOP_HOME,HADOOP_COMMON_HOME, HADOOP_ HDFS_HOME, HADOOP_ CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME</value>
            </property>
          
      • 上述方法我试了都不可行。
  • 解决办法:

    • 编辑 mapred-site.xml
        <property>
          <name>yarn.app.mapreduce.am.env</name>
          <value>HADOOP_MAPRED_HOME=${HADOOP_CLASSPATH}</value>
        </property>
        <property>
          <name>mapreduce.map.env</name>
          <value>HADOOP_MAPRED_HOME=${HADOOP_CLASSPATH}</value>
        </property>
        <property>
          <name>mapreduce.reduce.env</name>
          <value>HADOOP_MAPRED_HOME=${HADOOP_CLASSPATH}</value>
        </property>
      
    • export 配置 HADOOP_MAPRED_HOME
        export HADOOP_MAPRED_HOME=/usr/local/hadoop-3.1.3
      
      • 注意:这一步不能去掉,不然的话上面的配置会不生效。
    • 分发 mapred-site.xml 到集群其他节点
    • 重启集群。

The required MAP capability is more than the supported max container capability in the cluster.

  • 报错内容:

      The required MAP capability is more than the supported max container capability in the cluster. Killing the Job. mapResourceRequest: <memory:5120, vCores:1> maxContainerCapability:<memory:4096, vCores:4>
      Job received Kill while in RUNNING state.
      REDUCE capability required is more than the supported max container capability in the cluster. Killing the Job. reduceResourceRequest: <memory:5120, vCores:1> maxContainerCapability:<memory:4096, vCores:4>
    
  • 解决方法:

    • 需要调整两个参数:

      • yarn.nodemanager.resource.memory-mb
      • yarn.scheduler.maximum-allocation-mb
      • 注意:如果只设置其中一个,可能依然存在同样的报错
    • 编辑 yarn-site.xml

        <property>
          <name>yarn.nodemanager.resource.memory-mb</name>
          <value>10000</value>
        </property>
        <property>
          <name>yarn.scheduler.maximum-allocation-mb</name>
          <value>10000</value>
        </property>
      
    • 分发 yarn-site.xml 到集群其他节点
    • 重启集群。

Error: JAVA_HOME is not set and could not be found.

  • 报错内容:
    Error: JAVA_HOME is not set and could not be found.
    
  • 解决办法 1: 添加系统环境变设置
    • 检查环境变量设置
      which java
          
      java -version
      
    • 如果返回异常,则需要手动添加
      vim /etc/profile
          
      # JAVA_HOME
      export JAVA_HOME=/usr/local/bin/jdk1.8
      export PATH=$PATH:$JAVA_HOME/bin
      
    • 启动生效
      source /etc/profile
      
  • 解决方法 2:在 hadoop-env.sh 中的环境变量设置
      cd /usr/local/hadoop-3.1.3/etc/hadoop/
        
      vim hadoop-env.sh
        
      export JAVA_HOME=/usr/local/bin/jdk1.8
    
  • 注意:
    • 对于 hadoop 集群而言,前面所说的环境变量配置,必须要在 master 上进行,并在集群内部分发、而且启动生效;
    • 如果按照上面的操作进行以后仍然报错,请 仔细检查是否是在 master 上进行这些配置操作,检查集群机器是否全部配置生效

bash: jps: command not found

  • 报错内容:
      bash: jps: command not found
    
  • 解决方法:
    • 删除原有的 jps 软连接,并新建一个
      rm /usr/local/bin/jps
      ln -s /usr/local/bin/jdk1.8/bin/jps /usr/local/bin/jps
      

/usr/libexec/grepconf.sh:行5: grep: 未找到命令

  • 问题:
    • /etc/profile 配置 SPARK_HOME 后,执行 source /etc/profile 环境变量生效,提示报错;
    • 报错后命令行常规命令均无法正常使用
  • 经排查发现路径配置错误,需要修改路径,重新启用环境变量配置文件;

  • 解决方法:
    • 修改配置文件 SPARK_HOME 路径;
    • cmd 窗口执行如下命令:
        export PATH=/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin
      
    • 重新启用环境变量配置文件
        source /etc/profile