gala-anteater使用手册

gala-anteater是一款基于AI的操作系统异常检测平台。主要提供时序数据预处理、异常点发现、异常上报等功能。基于线下预训练、线上模型的增量学习与模型更新，能够很好地适用于多维多模态数据故障诊断。

本文主要介绍如何部署和使用gala-anteater服务。

安装

挂载repo源：

basic

[everything]
name=everything
baseurl=https://dl-cdn.openeuler.openatom.cn/openEuler-{version}/everything/
enabled=1
gpgcheck=0
priority=1

[EPOL]
name=EPOL
baseurl=https://dl-cdn.openeuler.openatom.cn/openEuler-{version}/EPOL/main/
enabled=1
gpgcheck=0
priority=1

安装gala-anteater：

bash

yum install gala-anteater

配置

说明：gala-anteater采用配置的config文件设置参数启动，配置文件位置: /etc/gala-anteater/config/gala-anteater.yaml。

配置文件默认参数

yaml

Global:
  data_source: "prometheus"

Arangodb:
  url: "http://localhost:8529"
  db_name: "spider"

Kafka:f
  server: "192.168.122.100"
  port: "9092"
  model_topic: "gala_anteater_hybrid_model"
  rca_topic: "gala_cause_inference"
  meta_topic: "gala_gopher_metadata"
  group_id: "gala_anteater_kafka"
  # auth_type: plaintext/sasl_plaintext, please set "" for no auth
  auth_type: ""
  username: ""
  password: ""

Prometheus:
  server: "localhost"
  port: "9090"
  steps: "5"

Aom:
  base_url: ""
  project_id: ""
  auth_type: "token"
  auth_info:
    iam_server: ""
    iam_domain: ""
    iam_user_name: ""
    iam_password: ""
    ssl_verify: 0

Schedule:
  duration: 1

参数	含义	默认值
Global
data_source	设置数据来源	“prometheus”
Arangodb
url	图数据库Arangodb的ip地址	"http://localhost:8529"
db_name	图数据库名	"spider"
Kafka
server	Kafka Server的ip地址，根据安装节点ip配置
port	Kafka Server的port，如：9092
model_topic	故障检测结果上报topic	"gala_anteater_hybrid_model"
rca_topic	根因定位结果上报topic	"gala_cause_inference"
meta_topic	gopher采集指标数据topic	"gala_gopher_metadata"
group_id	kafka设置组名	"gala_anteater_kafka"
Prometheus
server	Prometheus Server的ip地址，根据安装节点ip配置
port	Prometheus Server的port，如：9090
steps	指标采样间隔
Schedule
duration	异常检测模型执行频率（单位：分），每x分钟，检测一次	1

启动

执行如下命令启动gala-anteater

shell

systemctl start gala-anteater

注意：gala-anteater支持启动一个进程实例，启动多个会导致内存占用过大，日志混乱。

故障注入

gala-anteater为故障检测与根因定位模块，测试阶段需要通过故障注入来构造故障，从而通过故障检测和根因定位模块获得故障节点信息和故障传播根因节点信息。

故障注入（仅提供参考）
bash
```
chaosblade create disk burn --size 10 --read --write --path /var/lib/docker/overlay2/cf0a469be8a84cabe1d057216505f8d64735e9c63159e170743353a208f6c268/merged --timeout 120
```
*chaosblade 为故障注入工具，可以模拟各种故障，包括但不限于磁盘故障、网络故障、IO故障等待备注：通过注入不一样的故障，指标采集器(例如 gala-gopher) 监控关联指标并上报到 prometheus 模块， prometheus graph 指标图部分关联指标会存在明显波动。

查询gala-anteater服务状态

若日志显示如下内容，说明服务启动成功，启动日志也会保存到当前运行目录下logs/anteater.log文件中。

log

2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
2022-09-01 17:53:38,456 - root - INFO - Using cpu device
2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
...
2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)

异常检测输出数据

gala-anteater如果检测到异常点，会将结果输出至kafka的model_topic，输出数据格式如下：

json

{
   "Timestamp":1659075600000,
   "Attributes":{
      "entity_id":"xxxxxx_sli_1513_18",
      "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
      "event_type":"app"
   },
   "Resource":{
      "anomaly_score":1.0,
      "anomaly_count":13,
      "total_count":13,
      "duration":60,
      "anomaly_ratio":1.0,
      "metric_label":{
         "machine_id":"1fd37742xxxx",
         "tgid":"1513",
         "conn_fd":"18"
      },
      "recommend_metrics":{
         "gala_gopher_tcp_link_notack_bytes":{
            "label":{
               "__name__":"gala_gopher_tcp_link_notack_bytes",
               "client_ip":"x.x.x.165",
               "client_port":"51352",
               "hostname":"localhost.localdomain",
               "instance":"x.x.x.172:8888",
               "job":"prometheus-x.x.x.172",
               "machine_id":"xxxxxx",
               "protocol":"2",
               "role":"0",
               "server_ip":"x.x.x.172",
               "server_port":"8888",
               "tgid":"3381701"
            },
            "score":0.24421279500639545
         },
         ...
      },
      "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
   },
   "SeverityText":"WARN",
   "SeverityNumber":14,
   "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
}

根因定位输出数据

异常检测结果的每个异常节点都会触发根因定位，根因定位的结果会上报至kafka的rca_topic。输出数据格式如下：

yaml

{
  "Timestamp": 1724287883452,
  "event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
  "Attributes": {
    "event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
    "event_source": "root-cause-inference"
  },
  "Resource": {
    "abnormal_kpi": {
      "metric_id": "gala_gopher_l7_latency_sum",
      "entity_id": "",
      "metric_labels": {
        "client_ip": "192.168.11.103",
        "comm": "python",
        "container_id": "83d0c2f4a7f4",
        "container_image": "ba2d060a624e",
        "container_name": "/k8s_backend_backend-node2-01-5bcb47fd7c-4jxxs_default_475ae627",
        "instance": "192.168.122.102:8888",
        "job": "192.168.122.102",
        "l4_role": "tcp_server",
        "l7_role": "server",
        "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
        "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
        "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
        "pod_namespace": "default",
        "protocol": "http",
        "server_ip": "192.168.11.102",
        "server_port": "26",
        "ssl": "no_ssl",
        "tgid": "3459438"
      },
      "desc": "L7 session averaged latency.",
      "score": 0.3498585816683402
    },
    "cause_metrics": [
      {
        "metric_id": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
        "entity_id": "",
        "metric_labels": {
          "container_id": "1319ff912a6f",
          "container_image": "ba2d060a624e",
          "container_name": "/k8s_backend_backend-node3-02-654dd97bf9-s8jg5_default_4a9fcc23",
          "instance": "192.168.122.103:8888",
          "job": "192.168.122.103",
          "machine_id": "494a61be-23cc-4c97-a871-902866e43747-192.168.122.103",
          "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
          "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
          "pod_namespace": "default"
        },
        "desc": "\u5bb9\u56681s\u5185\u7528\u6237\u6001CPU\u8d1f\u8f7d",
        "keyword": "process",
        "score": 0.1194249668036936,
        "path": [
          {
            "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
            "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
            "instance": "192.168.122.103:8888",
            "job": "192.168.122.103",
            "pod_state": "normal"
          },
          {
            "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
            "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          }
        ]
      },
      {
        "metric_id": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
        "entity_id": "",
        "metric_labels": {
          "cmdline": "python ./backend.py ",
          "comm": "python",
          "container_id": "de570c7328bb",
          "container_image": "ba2d060a624e",
          "container_name": "/k8s_backend_backend-node2-02-548c79d989-bnl9g_default_67134fb4",
          "instance": "192.168.122.102:8888",
          "job": "192.168.122.102",
          "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
          "pgid": "3459969",
          "pod": "default/backend-node2-02-548c79d989-bnl9g",
          "pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
          "pod_namespace": "default",
          "ppid": "3459936",
          "start_time": "1139543501",
          "tgid": "3459969"
        },
        "desc": "\u8fdb\u7a0b\u7cfb\u7edf\u8c03\u7528\u81f3FS\u7684\u5199\u5b57\u8282\u6570",
        "keyword": "process",
        "score": 0.37121879175399997,
        "path": [
          {
            "pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
            "pod": "default/backend-node2-02-548c79d989-bnl9g",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "normal"
          },
          {
            "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
            "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
            "instance": "192.168.122.103:8888",
            "job": "192.168.122.103",
            "pod_state": "normal"
          },
          {
            "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
            "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          }
        ]
      },
      {
        "metric_id": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952",
        "entity_id": "",
        "metric_labels": {
          "client_ip": "192.168.11.103",
          "comm": "python",
          "container_id": "eef1ca1082a7",
          "container_image": "ba2d060a624e",
          "container_name": "/k8s_backend_backend-node2-03-584f4c6cfd-w4d2b_default_956c70a2",
          "instance": "192.168.122.102:8888",
          "job": "192.168.122.102",
          "l4_role": "tcp_server",
          "l7_role": "server",
          "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
          "pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
          "pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
          "pod_namespace": "default",
          "protocol": "http",
          "server_ip": "192.168.11.113",
          "server_port": "26",
          "ssl": "no_ssl",
          "tgid": "3460169"
        },
        "desc": "L7 session averaged latency.",
        "keyword": null,
        "score": 0.5624857367147617,
        "path": [
          {
            "pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
            "pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          },
          {
            "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
            "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
            "instance": "192.168.122.103:8888",
            "job": "192.168.122.103",
            "pod_state": "normal"
          },
          {
            "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
            "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          }
        ]
      }
    ]
  },
  "desc": "L7 session averaged latency.",
  "top1": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929\u5f02\u5e38",
  "top2": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43\u5f02\u5e38",
  "top3": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952\u5f02\u5e38",
  "keywords": [
    "process",
    null
  ],
  "SeverityText": "WARN",
  "SeverityNumber": 13,
  "Body": "A cause inferring event for an abnormal event"
}

gala-anteater使用手册 ​

安装 ​

配置 ​

配置文件默认参数 ​

启动 ​

故障注入 ​

查询gala-anteater服务状态 ​

异常检测输出数据 ​

根因定位输出数据 ​