长期支持版本

    社区创新版本

      gala-anteater使用手册

      gala-anteater是一款基于AI的操作系统异常检测平台。主要提供时序数据预处理、异常点发现、异常上报等功能。基于线下预训练、线上模型的增量学习与模型更新,能够很好地适用于多维多模态数据故障诊断。

      本文主要介绍如何部署和使用gala-anteater服务。

      安装

      挂载repo源:

      [everything]
      name=everything
      baseurl=http://121.36.84.172/dailybuild/EBS-openEuler-24.09/EBS-openEuler-24.09/everything/$basearch/
      enabled=1
      gpgcheck=0
      priority=1
      
      [EPOL]
      name=EPOL
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP4/EPOL/main/$basearch/
      enabled=1
      gpgcheck=0
      priority=1
      

      安装gala-anteater:

      yum install gala-anteater
      

      配置

      说明:

      gala-anteater采用配置的config文件设置参数启动,配置文件位置: /etc/gala-anteater/config/gala-anteater.yaml。

      配置文件默认参数
      Global:
        data_source: "prometheus"
      
      Arangodb:
        url: "http://localhost:8529"
        db_name: "spider"
      
      Kafka:
        server: "192.168.122.100"
        port: "9092"
        model_topic: "gala_anteater_hybrid_model"
        rca_topic: "gala_cause_inference"
        meta_topic: "gala_gopher_metadata"
        group_id: "gala_anteater_kafka"
        # auth_type: plaintext/sasl_plaintext, please set "" for no auth
        auth_type: ""
        username: ""
        password: ""
      
      Prometheus:
        server: "localhost"
        port: "9090"
        steps: "5"
      
      Aom:
        base_url: ""
        project_id: ""
        auth_type: "token"
        auth_info:
          iam_server: ""
          iam_domain: ""
          iam_user_name: ""
          iam_password: ""
          ssl_verify: 0
      
      Schedule:
        duration: 1
      
      参数含义默认值
      Global
      data_source设置数据来源“prometheus”
      Arangodb
      url图数据库Arangodb的ip地址"http://localhost:8529"
      db_name图数据库名"spider"
      Kafka
      serverKafka Server的ip地址,根据安装节点ip配置
      portKafka Server的port,如:9092
      model_topic故障检测结果上报topic"gala_anteater_hybrid_model"
      rca_topic根因定位结果上报topic"gala_cause_inference"
      meta_topicgopher采集指标数据topic"gala_gopher_metadata"
      group_idkafka设置组名"gala_anteater_kafka"
      Prometheus
      serverPrometheus Server的ip地址,根据安装节点ip配置
      portPrometheus Server的port,如:9090
      steps指标采样间隔
      Schedule
      duration异常检测模型执行频率(单位:分),每x分钟,检测一次1

      启动

      执行如下命令启动gala-anteater

      systemctl start gala-anteater
      

      说明:

      gala-anteater支持启动一个进程实例,启动多个会导致内存占用过大,日志混乱。

      故障注入

      gala-anteater为故障检测与根因定位模块,测试阶段需要通过故障注入来构造故障,从而通过故障检测和根因定位模块获得故障节点信息和故障传播根因节点信息。

      • 故障注入(仅提供参考)
        chaosblade create disk burn --size 10 --read --write --path /var/lib/docker/overlay2/cf0a469be8a84cabe1d057216505f8d64735e9c63159e170743353a208f6c268/merged --timeout 120
        
        *chaosblade 为故障注入工具, 可以模拟各种故障, 包括但不限于磁盘故障、网络故障、IO故障等待。 备注: 通过注入不一样的故障, 指标采集器(例如 gala-gopher) 监控关联指标并上报到 promethues 模块, prometheus graph 指标图部分关联指标会存在明显波动。

      查询gala-anteater服务状态

      若日志显示如下内容,说明服务启动成功,启动日志也会保存到当前运行目录下logs/anteater.log文件中。

      2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
      2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
      2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
      2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
      2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
      2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
      2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
      2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
      2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
      2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
      2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
      2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
      2022-09-01 17:53:38,456 - root - INFO - Using cpu device
      2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
      2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
      2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
      2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
      2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
      ...
      2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
      2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
      2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
      2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
      2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
      2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
      2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
      2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)
      

      异常检测输出数据

      gala-anteater如果检测到异常点,会将结果输出至kafka的model_topic,输出数据格式如下:

      {
         "Timestamp":1659075600000,
         "Attributes":{
            "entity_id":"xxxxxx_sli_1513_18",
            "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
            "event_type":"app"
         },
         "Resource":{
            "anomaly_score":1.0,
            "anomaly_count":13,
            "total_count":13,
            "duration":60,
            "anomaly_ratio":1.0,
            "metric_label":{
               "machine_id":"1fd37742xxxx",
               "tgid":"1513",
               "conn_fd":"18"
            },
            "recommend_metrics":{
               "gala_gopher_tcp_link_notack_bytes":{
                  "label":{
                     "__name__":"gala_gopher_tcp_link_notack_bytes",
                     "client_ip":"x.x.x.165",
                     "client_port":"51352",
                     "hostname":"localhost.localdomain",
                     "instance":"x.x.x.172:8888",
                     "job":"prometheus-x.x.x.172",
                     "machine_id":"xxxxxx",
                     "protocol":"2",
                     "role":"0",
                     "server_ip":"x.x.x.172",
                     "server_port":"8888",
                     "tgid":"3381701"
                  },
                  "score":0.24421279500639545
               },
               ...
            },
            "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
         },
         "SeverityText":"WARN",
         "SeverityNumber":14,
         "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
      }
      

      根因定位输出数据

      异常检测结果的每个异常节点都会触发根因定位,根因定位的结果会上报至kafka的rca_topic。输出数据格式如下:

      {
        "Timestamp": 1724287883452,
        "event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
        "Attributes": {
          "event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
          "event_source": "root-cause-inference"
        },
        "Resource": {
          "abnormal_kpi": {
            "metric_id": "gala_gopher_l7_latency_sum",
            "entity_id": "",
            "metric_labels": {
              "client_ip": "192.168.11.103",
              "comm": "python",
              "container_id": "83d0c2f4a7f4",
              "container_image": "ba2d060a624e",
              "container_name": "/k8s_backend_backend-node2-01-5bcb47fd7c-4jxxs_default_475ae627",
              "instance": "192.168.122.102:8888",
              "job": "192.168.122.102",
              "l4_role": "tcp_server",
              "l7_role": "server",
              "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
              "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
              "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
              "pod_namespace": "default",
              "protocol": "http",
              "server_ip": "192.168.11.102",
              "server_port": "26",
              "ssl": "no_ssl",
              "tgid": "3459438"
            },
            "desc": "L7 session averaged latency.",
            "score": 0.3498585816683402
          },
          "cause_metrics": [
            {
              "metric_id": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
              "entity_id": "",
              "metric_labels": {
                "container_id": "1319ff912a6f",
                "container_image": "ba2d060a624e",
                "container_name": "/k8s_backend_backend-node3-02-654dd97bf9-s8jg5_default_4a9fcc23",
                "instance": "192.168.122.103:8888",
                "job": "192.168.122.103",
                "machine_id": "494a61be-23cc-4c97-a871-902866e43747-192.168.122.103",
                "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
                "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
                "pod_namespace": "default"
              },
              "desc": "\u5bb9\u56681s\u5185\u7528\u6237\u6001CPU\u8d1f\u8f7d",
              "keyword": "process",
              "score": 0.1194249668036936,
              "path": [
                {
                  "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
                  "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
                  "instance": "192.168.122.103:8888",
                  "job": "192.168.122.103",
                  "pod_state": "normal"
                },
                {
                  "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
                  "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
                  "instance": "192.168.122.102:8888",
                  "job": "192.168.122.102",
                  "pod_state": "abnormal"
                }
              ]
            },
            {
              "metric_id": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
              "entity_id": "",
              "metric_labels": {
                "cmdline": "python ./backend.py ",
                "comm": "python",
                "container_id": "de570c7328bb",
                "container_image": "ba2d060a624e",
                "container_name": "/k8s_backend_backend-node2-02-548c79d989-bnl9g_default_67134fb4",
                "instance": "192.168.122.102:8888",
                "job": "192.168.122.102",
                "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
                "pgid": "3459969",
                "pod": "default/backend-node2-02-548c79d989-bnl9g",
                "pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
                "pod_namespace": "default",
                "ppid": "3459936",
                "start_time": "1139543501",
                "tgid": "3459969"
              },
              "desc": "\u8fdb\u7a0b\u7cfb\u7edf\u8c03\u7528\u81f3FS\u7684\u5199\u5b57\u8282\u6570",
              "keyword": "process",
              "score": 0.37121879175399997,
              "path": [
                {
                  "pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
                  "pod": "default/backend-node2-02-548c79d989-bnl9g",
                  "instance": "192.168.122.102:8888",
                  "job": "192.168.122.102",
                  "pod_state": "normal"
                },
                {
                  "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
                  "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
                  "instance": "192.168.122.103:8888",
                  "job": "192.168.122.103",
                  "pod_state": "normal"
                },
                {
                  "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
                  "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
                  "instance": "192.168.122.102:8888",
                  "job": "192.168.122.102",
                  "pod_state": "abnormal"
                }
              ]
            },
            {
              "metric_id": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952",
              "entity_id": "",
              "metric_labels": {
                "client_ip": "192.168.11.103",
                "comm": "python",
                "container_id": "eef1ca1082a7",
                "container_image": "ba2d060a624e",
                "container_name": "/k8s_backend_backend-node2-03-584f4c6cfd-w4d2b_default_956c70a2",
                "instance": "192.168.122.102:8888",
                "job": "192.168.122.102",
                "l4_role": "tcp_server",
                "l7_role": "server",
                "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
                "pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
                "pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
                "pod_namespace": "default",
                "protocol": "http",
                "server_ip": "192.168.11.113",
                "server_port": "26",
                "ssl": "no_ssl",
                "tgid": "3460169"
              },
              "desc": "L7 session averaged latency.",
              "keyword": null,
              "score": 0.5624857367147617,
              "path": [
                {
                  "pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
                  "pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
                  "instance": "192.168.122.102:8888",
                  "job": "192.168.122.102",
                  "pod_state": "abnormal"
                },
                {
                  "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
                  "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
                  "instance": "192.168.122.103:8888",
                  "job": "192.168.122.103",
                  "pod_state": "normal"
                },
                {
                  "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
                  "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
                  "instance": "192.168.122.102:8888",
                  "job": "192.168.122.102",
                  "pod_state": "abnormal"
                }
              ]
            }
          ]
        },
        "desc": "L7 session averaged latency.",
        "top1": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929\u5f02\u5e38",
        "top2": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43\u5f02\u5e38",
        "top3": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952\u5f02\u5e38",
        "keywords": [
          "process",
          null
        ],
        "SeverityText": "WARN",
        "SeverityNumber": 13,
        "Body": "A cause inferring event for an abnormal event"
      }
      

      文档捉虫

      “有虫”文档片段

      问题描述

      提交类型 issue

      有点复杂...

      找人问问吧。

      PR

      小问题,全程线上修改...

      一键搞定!

      问题类型
      规范和低错类

      ● 错别字或拼写错误;标点符号使用错误;

      ● 链接错误、空单元格、格式错误;

      ● 英文中包含中文字符;

      ● 界面和描述不一致,但不影响操作;

      ● 表述不通顺,但不影响理解;

      ● 版本号不匹配:如软件包名称、界面版本号;

      易用性

      ● 关键步骤错误或缺失,无法指导用户完成任务;

      ● 缺少必要的前提条件、注意事项等;

      ● 图形、表格、文字等晦涩难懂;

      ● 逻辑不清晰,该分类、分项、分步骤的没有给出;

      正确性

      ● 技术原理、功能、规格等描述和软件不一致,存在错误;

      ● 原理图、架构图等存在错误;

      ● 命令、命令参数等错误;

      ● 代码片段错误;

      ● 命令无法完成对应功能;

      ● 界面错误,无法指导操作;

      风险提示

      ● 对重要数据或系统存在风险的操作,缺少安全提示;

      内容合规

      ● 违反法律法规,涉及政治、领土主权等敏感词;

      ● 内容侵权;

      您对文档的总体满意度

      非常不满意
      非常满意
      提交
      根据您的反馈,会自动生成issue模板。您只需点击按钮,创建issue即可。
      文档捉虫
      编组 3备份