Server

Version: 25.03

Using gala-anteater

gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.

This chapter describes how to deploy and use the gala-anteater service.

Installation

Mount the repositories.

conf
[everything]
name=everything
baseurl=http://121.36.84.172/dailybuild/EBS-openEuler-24.09/EBS-openEuler-24.09/everything/$basearch/
enabled=1
gpgcheck=0
priority=1

[EPOL]
name=EPOL
baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP4/EPOL/main/$basearch/
enabled=1
gpgcheck=0
priority=1

Install gala-anteater.

bash
yum install gala-anteater

Configuration

Note: gala-anteater uses a configuration file (/etc/gala-anteater/config/gala-anteater.yaml) for its startup settings.

Configuration Parameters

yaml
Global:
  data_source: "prometheus"

Arangodb:
  url: "http://localhost:8529"
  db_name: "spider"

Kafka:f
  server: "192.168.122.100"
  port: "9092"
  model_topic: "gala_anteater_hybrid_model"
  rca_topic: "gala_cause_inference"
  meta_topic: "gala_gopher_metadata"
  group_id: "gala_anteater_kafka"
  # auth_type: plaintext/sasl_plaintext, please set "" for no auth
  auth_type: ""
  username: ""
  password: ""

Prometheus:
  server: "localhost"
  port: "9090"
  steps: "5"

Aom:
  base_url: ""
  project_id: ""
  auth_type: "token"
  auth_info:
    iam_server: ""
    iam_domain: ""
    iam_user_name: ""
    iam_password: ""
    ssl_verify: 0

Schedule:
  duration: 1
ParameterDescriptionDefault Value
Global
data_sourceData source"prometheus"
Arangodb
urlIP address of the ArangoDB graph database"http://localhost:8529"
db_nameName of the ArangoDB database"spider"
Kafka
serverIP address of the Kafka server. Configure according to the installation node IP address.
portPort of the Kafka server (for example, 9092)
model_topicTopic for reporting fault detection results"gala_anteater_hybrid_model"
rca_topicTopic for reporting root cause analysis results"gala_cause_inference"
meta_topicTopic for gopher to collect metric data"gala_gopher_metadata"
group_idKafka group ID"gala_anteater_kafka"
Prometheus
serverIP address of the Prometheus server. Configure according to the installation node IP address.
portPort of the Prometheus server (for example, 9090)
stepsMetric sampling interval
Schedule
durationInterval (in minutes) between anomaly detection model executions1

Start

Start gala-anteater.

bash
systemctl start gala-anteater

Fault Injection

gala-anteater is a fault detection and root cause locating module. In the testing phase, you need to inject faults to construct fault scenarios. This allows gala-anteater to obtain information about faulty nodes and the root cause nodes of fault propagation.

  • Fault injection (for reference only)

    bash
    chaosblade create disk burn --size 10 --read --write --path /var/lib/docker/overlay2/cf0a469be8a84cabe1d057216505f8d64735e9c63159e170743353a208f6c268/merged --timeout 120

    ChaosBlade is a fault injection tool that can simulate various faults, including but not limited to drive faults, network faults, and I/O faults. Note: Injecting different faults will cause corresponding fluctuations in related metrics monitored and reported to the Prometheus module by metric collectors (such as gala-gopher). These fluctuations will be visible in the Prometheus graph.

gala-anteater Service Status Query

If the following information is displayed, the service is started successfully. The startup log is saved to the logs/anteater.log file in the current running directory.

log
2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
2022-09-01 17:53:38,456 - root - INFO - Using cpu device
2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
...
2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)

Output Data of Fault Detection

If gala-anteater detects an exception, it sends the result to model_topic of Kafka. The output data format is as follows:

json
{
   "Timestamp":1659075600000,
   "Attributes":{
      "entity_id":"xxxxxx_sli_1513_18",
      "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
      "event_type":"app"
   },
   "Resource":{
      "anomaly_score":1.0,
      "anomaly_count":13,
      "total_count":13,
      "duration":60,
      "anomaly_ratio":1.0,
      "metric_label":{
         "machine_id":"1fd37742xxxx",
         "tgid":"1513",
         "conn_fd":"18"
      },
      "recommend_metrics":{
         "gala_gopher_tcp_link_notack_bytes":{
            "label":{
               "__name__":"gala_gopher_tcp_link_notack_bytes",
               "client_ip":"x.x.x.165",
               "client_port":"51352",
               "hostname":"localhost.localdomain",
               "instance":"x.x.x.172:8888",
               "job":"prometheus-x.x.x.172",
               "machine_id":"xxxxxx",
               "protocol":"2",
               "role":"0",
               "server_ip":"x.x.x.172",
               "server_port":"8888",
               "tgid":"3381701"
            },
            "score":0.24421279500639545
         },
         ...
      },
      "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
   },
   "SeverityText":"WARN",
   "SeverityNumber":14,
   "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
}

Output Data of Root Cause Locating

Each faulty node detected triggers root cause locating. Results of root cause locating are sent to rca_topic of Kafka. The output data format is as follows:

yaml
{
  "Timestamp": 1724287883452,
  "event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
  "Attributes": {
    "event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
    "event_source": "root-cause-inference"
  },
  "Resource": {
    "abnormal_kpi": {
      "metric_id": "gala_gopher_l7_latency_sum",
      "entity_id": "",
      "metric_labels": {
        "client_ip": "192.168.11.103",
        "comm": "python",
        "container_id": "83d0c2f4a7f4",
        "container_image": "ba2d060a624e",
        "container_name": "/k8s_backend_backend-node2-01-5bcb47fd7c-4jxxs_default_475ae627",
        "instance": "192.168.122.102:8888",
        "job": "192.168.122.102",
        "l4_role": "tcp_server",
        "l7_role": "server",
        "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
        "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
        "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
        "pod_namespace": "default",
        "protocol": "http",
        "server_ip": "192.168.11.102",
        "server_port": "26",
        "ssl": "no_ssl",
        "tgid": "3459438"
      },
      "desc": "L7 session averaged latency.",
      "score": 0.3498585816683402
    },
    "cause_metrics": [
      {
        "metric_id": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
        "entity_id": "",
        "metric_labels": {
          "container_id": "1319ff912a6f",
          "container_image": "ba2d060a624e",
          "container_name": "/k8s_backend_backend-node3-02-654dd97bf9-s8jg5_default_4a9fcc23",
          "instance": "192.168.122.103:8888",
          "job": "192.168.122.103",
          "machine_id": "494a61be-23cc-4c97-a871-902866e43747-192.168.122.103",
          "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
          "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
          "pod_namespace": "default"
        },
        "desc": "\u5bb9\u56681s\u5185\u7528\u6237\u6001CPU\u8d1f\u8f7d",
        "keyword": "process",
        "score": 0.1194249668036936,
        "path": [
          {
            "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
            "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
            "instance": "192.168.122.103:8888",
            "job": "192.168.122.103",
            "pod_state": "normal"
          },
          {
            "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
            "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          }
        ]
      },
      {
        "metric_id": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
        "entity_id": "",
        "metric_labels": {
          "cmdline": "python ./backend.py ",
          "comm": "python",
          "container_id": "de570c7328bb",
          "container_image": "ba2d060a624e",
          "container_name": "/k8s_backend_backend-node2-02-548c79d989-bnl9g_default_67134fb4",
          "instance": "192.168.122.102:8888",
          "job": "192.168.122.102",
          "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
          "pgid": "3459969",
          "pod": "default/backend-node2-02-548c79d989-bnl9g",
          "pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
          "pod_namespace": "default",
          "ppid": "3459936",
          "start_time": "1139543501",
          "tgid": "3459969"
        },
        "desc": "\u8fdb\u7a0b\u7cfb\u7edf\u8c03\u7528\u81f3FS\u7684\u5199\u5b57\u8282\u6570",
        "keyword": "process",
        "score": 0.37121879175399997,
        "path": [
          {
            "pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
            "pod": "default/backend-node2-02-548c79d989-bnl9g",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "normal"
          },
          {
            "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
            "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
            "instance": "192.168.122.103:8888",
            "job": "192.168.122.103",
            "pod_state": "normal"
          },
          {
            "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
            "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          }
        ]
      },
      {
        "metric_id": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952",
        "entity_id": "",
        "metric_labels": {
          "client_ip": "192.168.11.103",
          "comm": "python",
          "container_id": "eef1ca1082a7",
          "container_image": "ba2d060a624e",
          "container_name": "/k8s_backend_backend-node2-03-584f4c6cfd-w4d2b_default_956c70a2",
          "instance": "192.168.122.102:8888",
          "job": "192.168.122.102",
          "l4_role": "tcp_server",
          "l7_role": "server",
          "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
          "pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
          "pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
          "pod_namespace": "default",
          "protocol": "http",
          "server_ip": "192.168.11.113",
          "server_port": "26",
          "ssl": "no_ssl",
          "tgid": "3460169"
        },
        "desc": "L7 session averaged latency.",
        "keyword": null,
        "score": 0.5624857367147617,
        "path": [
          {
            "pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
            "pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          },
          {
            "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
            "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
            "instance": "192.168.122.103:8888",
            "job": "192.168.122.103",
            "pod_state": "normal"
          },
          {
            "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
            "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          }
        ]
      }
    ]
  },
  "desc": "L7 session averaged latency.",
  "top1": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929\u5f02\u5e38",
  "top2": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43\u5f02\u5e38",
  "top3": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952\u5f02\u5e38",
  "keywords": [
    "process",
    null
  ],
  "SeverityText": "WARN",
  "SeverityNumber": 13,
  "Body": "A cause inferring event for an abnormal event"
}