gala-anteater使用手册

gala-anteater是一款基于AI的操作系统异常检测平台。主要提供时序数据预处理、异常点发现、异常上报等功能。基于线下预训练、线上模型的增量学习与模型更新,能够很好地适用于多维多模态数据故障诊断。

本文主要介绍如何部署和使用gala-anteater服务,检测训练集群中的慢节点/慢卡。

安装

挂载repo源:

basic
[everything]
name=everything
baseurl=http://121.36.84.172/dailybuild/EBS-openEuler-24.03-LTS-SP1/rc4_openeuler-2024-12-05-15-40-49/everything/$basearch/
enabled=1
gpgcheck=0
priority=1

[EPOL]
name=EPOL
baseurl=http://repo.openeuler.org/EBS-openEuler-24.03-LTS-SP1/EPOL/main/$basearch/
enabled=1
gpgcheck=0
priority=1

安装gala-anteater:

bash
yum install gala-anteater

配置

说明:

gala-anteater采用配置的config文件设置参数启动,配置文件位置: /etc/gala-anteater/config/gala-anteater.yaml。

配置文件默认参数

yaml
Global:
  data_source: "prometheus"

Arangodb:
  url: "http://localhost:8529"
  db_name: "spider"

Kafka:
  server: "192.168.122.100"
  port: "9092"
  model_topic: "gala_anteater_hybrid_model"
  rca_topic: "gala_cause_inference"
  meta_topic: "gala_gopher_metadata"
  group_id: "gala_anteater_kafka"
  # auth_type: plaintext/sasl_plaintext, please set "" for no auth
  auth_type: ""
  username: ""
  password: ""

Prometheus:
  server: "localhost"
  port: "9090"
  steps: "5"

Aom:
  base_url: ""
  project_id: ""
  auth_type: "token"
  auth_info:
    iam_server: ""
    iam_domain: ""
    iam_user_name: ""
    iam_password: ""
    ssl_verify: 0

Schedule:
  duration: 1
  
Suppression:
  interval: 10
参数含义默认值
Global全局配置字典类型
data_source设置数据来源"prometheus"
ArangodbArangodb图数据库配置信息字典类型
url图数据库Arangodb的ip地址"http://localhost:8529"
db_name图数据库名"spider"
Kafkakafka配置信息字典类型
serverKafka Server的ip地址,根据安装节点ip配置"192.168.122.100"
portKafka Server的port,如:9092"9092"
model_topic故障检测结果上报topic"gala_anteater_hybrid_model"
rca_topic根因定位结果上报topic"gala_cause_inference"
meta_topicgopher采集指标数据topic"gala_gopher_metadata"
group_idkafka设置组名"gala_anteater_kafka"
Prometheus数据源prometheus配置信息字典类型
serverPrometheus Server的ip地址,根据安装节点ip配置"localhost"
portPrometheus Server的port,如:9090"9090"
steps指标采样间隔"5"
Schedule循环调度配置信息字典类型
duration异常检测模型执行频率(单位:分),每x分钟,检测一次1
Suppression告警抑制配置信息字典类型
interval告警抑制间隔(单位: 分),表示距离上一次告警x分钟内相同告警过滤10

启动

执行如下命令启动gala-anteater

shell
systemctl start gala-anteater

说明:

gala-anteater支持启动一个进程实例,启动多个会导致内存占用过大,日志混乱。

查询gala-anteater服务慢节点检测执行状态

若日志显示如下内容,说明慢节点正常运行,启动日志也会保存到当前运行目录下/var/log/gala-anteater/gala-anteater.log文件中。

log
2024-12-02 16:25:20,727 - INFO - anteater - Groups-0, metric: npu_chip_info_hbm_used_memory, start detection.
2024-12-02 16:25:20,735 - INFO - anteater - Metric-npu_chip_info_hbm_used_memory single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:20,739 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection start.
2024-12-02 16:25:21,128 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:21,137 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
2024-12-02 16:25:21,139 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
2024-12-02 16:25:21,141 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
2024-12-02 16:25:21,142 - INFO - anteater - space_nodes_compare result: [].
2024-12-02 16:25:21,142 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:21,144 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection end.

2024-12-02 16:25:21,144 - INFO - anteater - Groups-0, metric: npu_chip_info_aicore_current_freq, start detection.
2024-12-02 16:25:21,153 - INFO - anteater - Metric-npu_chip_info_aicore_current_freq single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:21,157 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection start.
2024-12-02 16:25:21,584 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:21,592 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,594 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,597 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,598 - INFO - anteater - space_nodes_compare result: [].
2024-12-02 16:25:21,598 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:21,598 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection end.

2024-12-02 16:25:21,598 - INFO - anteater - Groups-0, metric: npu_chip_roce_tx_err_pkt_num, start detection.
2024-12-02 16:25:21,607 - INFO - anteater - Metric-npu_chip_roce_tx_err_pkt_num single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:21,611 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection start.
2024-12-02 16:25:22,040 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:22,040 - INFO - anteater - Skip space nodes compare.
2024-12-02 16:25:22,040 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:22,040 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection end.

2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 1/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 2/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 3/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 4/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 5/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 6/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 7/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 8/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 9/9
2024-12-02 16:25:22,043 - INFO - anteater - SlowNodeDetector._execute costs 1.83 seconds!
2024-12-02 16:25:22,043 - INFO - anteater - END!

异常检测输出数据

gala-anteater如果检测到异常点,会将结果输出至kafka的model_topic,输出数据格式如下:

json
{
    "Timestamp": 1730732076935, 
    "Attributes": {
        "resultCode": 201, 
        "compute": false, 
        "network": false, 
        "storage": true, 
        "abnormalDetail": [{
            "objectId": "-1", 
            "serverIp": "96.13.19.31", 
            "deviceInfo": "96.13.19.31:8888*-1", 
            "kpiId": "gala_gopher_disk_wspeed_kB", 
            "methodType": "TIME", 
            "kpiData": [], 
            "relaIds": [], 
            "omittedDevices": []
        }], 
        "normalDetail": [], 
        "errorMsg": ""
    }, 
    "SeverityText": "WARN", 
    "SeverityNumber": 13, 
    "is_anomaly": true
}

输出字段说明

输出字段单位含义
Timestampms检测到故障上报的时刻
resultCodeint故障码,201表示故障,200表示无故障
computebool故障类型是否为计算类型
networkbool故障类型是否为网络类型
storagebool故障类型是否为存储类型
abnormalDetaillist表示故障的细节
objectIdint故障对象id,-1表示节点故障,0-7表示具体的故障卡号
serverIpstring故障对象ip
deviceInfostring详细的故障信息
kpiIdstring检测到故障的算法类型,"TIME", "SPACE"
kpiDatalist故障时序数据,需开关打开,默认关闭
relaIdslist故障卡关联的正常卡,表示在”SPACE“算法下对比的正常卡号
omittedDeviceslist忽略显示的卡号
normalDetaillist正常卡的时序数据
errorMsgstring错误信息
SeverityTextstring错误类型,表示"WARN", "ERROR"
SeverityNumberint错误等级
is_anomalybool表示是否故障