长期支持版本

    社区创新版本

      gala-anteater使用手册

      gala-anteater是一款基于AI的操作系统异常检测平台。主要提供时序数据预处理、异常点发现、异常上报等功能。基于线下预训练、线上模型的增量学习与模型更新,能够很好地适用于多维多模态数据故障诊断。

      本文主要介绍如何部署和使用gala-anteater服务,检测训练集群中的慢节点/慢卡。

      安装

      挂载repo源:

      [everything]
      name=everything
      baseurl=http://121.36.84.172/dailybuild/EBS-openEuler-24.03-LTS-SP1/rc4_openeuler-2024-12-05-15-40-49/everything/$basearch/
      enabled=1
      gpgcheck=0
      priority=1
      
      [EPOL]
      name=EPOL
      baseurl=http://repo.openeuler.org/EBS-openEuler-24.03-LTS-SP1/EPOL/main/$basearch/
      enabled=1
      gpgcheck=0
      priority=1
      

      安装gala-anteater:

      yum install gala-anteater
      

      配置

      说明:

      gala-anteater采用配置的config文件设置参数启动,配置文件位置: /etc/gala-anteater/config/gala-anteater.yaml。

      配置文件默认参数

      Global:
        data_source: "prometheus"
      
      Arangodb:
        url: "http://localhost:8529"
        db_name: "spider"
      
      Kafka:
        server: "192.168.122.100"
        port: "9092"
        model_topic: "gala_anteater_hybrid_model"
        rca_topic: "gala_cause_inference"
        meta_topic: "gala_gopher_metadata"
        group_id: "gala_anteater_kafka"
        # auth_type: plaintext/sasl_plaintext, please set "" for no auth
        auth_type: ""
        username: ""
        password: ""
      
      Prometheus:
        server: "localhost"
        port: "9090"
        steps: "5"
      
      Aom:
        base_url: ""
        project_id: ""
        auth_type: "token"
        auth_info:
          iam_server: ""
          iam_domain: ""
          iam_user_name: ""
          iam_password: ""
          ssl_verify: 0
      
      Schedule:
        duration: 1
        
      Suppression:
        interval: 10
      
      参数含义默认值
      Global全局配置字典类型
      data_source设置数据来源"prometheus"
      ArangodbArangodb图数据库配置信息字典类型
      url图数据库Arangodb的ip地址"http://localhost:8529"
      db_name图数据库名"spider"
      Kafkakafka配置信息字典类型
      serverKafka Server的ip地址,根据安装节点ip配置"192.168.122.100"
      portKafka Server的port,如:9092"9092"
      model_topic故障检测结果上报topic"gala_anteater_hybrid_model"
      rca_topic根因定位结果上报topic"gala_cause_inference"
      meta_topicgopher采集指标数据topic"gala_gopher_metadata"
      group_idkafka设置组名"gala_anteater_kafka"
      Prometheus数据源prometheus配置信息字典类型
      serverPrometheus Server的ip地址,根据安装节点ip配置"localhost"
      portPrometheus Server的port,如:9090"9090"
      steps指标采样间隔"5"
      Schedule循环调度配置信息字典类型
      duration异常检测模型执行频率(单位:分),每x分钟,检测一次1
      Suppression告警抑制配置信息字典类型
      interval告警抑制间隔(单位: 分),表示距离上一次告警x分钟内相同告警过滤10

      启动

      执行如下命令启动gala-anteater

      systemctl start gala-anteater
      

      说明:

      gala-anteater支持启动一个进程实例,启动多个会导致内存占用过大,日志混乱。

      查询gala-anteater服务慢节点检测执行状态

      若日志显示如下内容,说明慢节点正常运行,启动日志也会保存到当前运行目录下/var/log/gala-anteater/gala-anteater.log文件中。

      2024-12-02 16:25:20,727 - INFO - anteater - Groups-0, metric: npu_chip_info_hbm_used_memory, start detection.
      2024-12-02 16:25:20,735 - INFO - anteater - Metric-npu_chip_info_hbm_used_memory single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
      2024-12-02 16:25:20,739 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection start.
      2024-12-02 16:25:21,128 - INFO - anteater - time_node_compare result: [].
      2024-12-02 16:25:21,137 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
      2024-12-02 16:25:21,139 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
      2024-12-02 16:25:21,141 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
      2024-12-02 16:25:21,142 - INFO - anteater - space_nodes_compare result: [].
      2024-12-02 16:25:21,142 - INFO - anteater - Time and space aggregated result: [].
      2024-12-02 16:25:21,144 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection end.
      
      2024-12-02 16:25:21,144 - INFO - anteater - Groups-0, metric: npu_chip_info_aicore_current_freq, start detection.
      2024-12-02 16:25:21,153 - INFO - anteater - Metric-npu_chip_info_aicore_current_freq single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
      2024-12-02 16:25:21,157 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection start.
      2024-12-02 16:25:21,584 - INFO - anteater - time_node_compare result: [].
      2024-12-02 16:25:21,592 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
      2024-12-02 16:25:21,594 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
      2024-12-02 16:25:21,597 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
      2024-12-02 16:25:21,598 - INFO - anteater - space_nodes_compare result: [].
      2024-12-02 16:25:21,598 - INFO - anteater - Time and space aggregated result: [].
      2024-12-02 16:25:21,598 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection end.
      
      2024-12-02 16:25:21,598 - INFO - anteater - Groups-0, metric: npu_chip_roce_tx_err_pkt_num, start detection.
      2024-12-02 16:25:21,607 - INFO - anteater - Metric-npu_chip_roce_tx_err_pkt_num single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
      2024-12-02 16:25:21,611 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection start.
      2024-12-02 16:25:22,040 - INFO - anteater - time_node_compare result: [].
      2024-12-02 16:25:22,040 - INFO - anteater - Skip space nodes compare.
      2024-12-02 16:25:22,040 - INFO - anteater - Time and space aggregated result: [].
      2024-12-02 16:25:22,040 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection end.
      
      2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 1/9
      2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 2/9
      2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 3/9
      2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 4/9
      2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 5/9
      2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 6/9
      2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 7/9
      2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 8/9
      2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 9/9
      2024-12-02 16:25:22,043 - INFO - anteater - SlowNodeDetector._execute costs 1.83 seconds!
      2024-12-02 16:25:22,043 - INFO - anteater - END!
      

      异常检测输出数据

      gala-anteater如果检测到异常点,会将结果输出至kafka的model_topic,输出数据格式如下:

      {
          "Timestamp": 1730732076935, 
          "Attributes": {
              "resultCode": 201, 
              "compute": false, 
              "network": false, 
              "storage": true, 
              "abnormalDetail": [{
                  "objectId": "-1", 
                  "serverIp": "96.13.19.31", 
                  "deviceInfo": "96.13.19.31:8888*-1", 
                  "kpiId": "gala_gopher_disk_wspeed_kB", 
                  "methodType": "TIME", 
                  "kpiData": [], 
                  "relaIds": [], 
                  "omittedDevices": []
              }], 
              "normalDetail": [], 
              "errorMsg": ""
          }, 
          "SeverityText": "WARN", 
          "SeverityNumber": 13, 
          "is_anomaly": true
      }
      

      输出字段说明

      输出字段单位含义
      Timestampms检测到故障上报的时刻
      resultCodeint故障码,201表示故障,200表示无故障
      computebool故障类型是否为计算类型
      networkbool故障类型是否为网络类型
      storagebool故障类型是否为存储类型
      abnormalDetaillist表示故障的细节
      objectIdint故障对象id,-1表示节点故障,0-7表示具体的故障卡号
      serverIpstring故障对象ip
      deviceInfostring详细的故障信息
      kpiIdstring检测到故障的算法类型,"TIME", "SPACE"
      kpiDatalist故障时序数据,需开关打开,默认关闭
      relaIdslist故障卡关联的正常卡,表示在”SPACE“算法下对比的正常卡号
      omittedDeviceslist忽略显示的卡号
      normalDetaillist正常卡的时序数据
      errorMsgstring错误信息
      SeverityTextstring错误类型,表示"WARN", "ERROR"
      SeverityNumberint错误等级
      is_anomalybool表示是否故障

      文档捉虫

      “有虫”文档片段

      问题描述

      提交类型 issue

      有点复杂...

      找人问问吧。

      PR

      小问题,全程线上修改...

      一键搞定!

      问题类型
      规范和低错类

      ● 错别字或拼写错误;标点符号使用错误;

      ● 链接错误、空单元格、格式错误;

      ● 英文中包含中文字符;

      ● 界面和描述不一致,但不影响操作;

      ● 表述不通顺,但不影响理解;

      ● 版本号不匹配:如软件包名称、界面版本号;

      易用性

      ● 关键步骤错误或缺失,无法指导用户完成任务;

      ● 缺少必要的前提条件、注意事项等;

      ● 图形、表格、文字等晦涩难懂;

      ● 逻辑不清晰,该分类、分项、分步骤的没有给出;

      正确性

      ● 技术原理、功能、规格等描述和软件不一致,存在错误;

      ● 原理图、架构图等存在错误;

      ● 命令、命令参数等错误;

      ● 代码片段错误;

      ● 命令无法完成对应功能;

      ● 界面错误,无法指导操作;

      风险提示

      ● 对重要数据或系统存在风险的操作,缺少安全提示;

      内容合规

      ● 违反法律法规,涉及政治、领土主权等敏感词;

      ● 内容侵权;

      您对文档的总体满意度

      非常不满意
      非常满意
      提交
      根据您的反馈,会自动生成issue模板。您只需点击按钮,创建issue即可。
      文档捉虫
      编组 3备份