Using gala-anteater

gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.

This chapter describes how to deploy and use the gala-anteater service.

Installation

Mount the repositories.

conf

[everything]
name=everything
baseurl=http://121.36.84.172/dailybuild/EBS-openEuler-24.09/EBS-openEuler-24.09/everything/$basearch/
enabled=1
gpgcheck=0
priority=1

[EPOL]
name=EPOL
baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP4/EPOL/main/$basearch/
enabled=1
gpgcheck=0
priority=1

Install gala-anteater.

bash

yum install gala-anteater

Configuration

Note: gala-anteater uses a configuration file (/etc/gala-anteater/config/gala-anteater.yaml) for its startup settings.

Configuration Parameters

yaml

Global:
  data_source: "prometheus"

Arangodb:
  url: "http://localhost:8529"
  db_name: "spider"

Kafka:f
  server: "192.168.122.100"
  port: "9092"
  model_topic: "gala_anteater_hybrid_model"
  rca_topic: "gala_cause_inference"
  meta_topic: "gala_gopher_metadata"
  group_id: "gala_anteater_kafka"
  # auth_type: plaintext/sasl_plaintext, please set "" for no auth
  auth_type: ""
  username: ""
  password: ""

Prometheus:
  server: "localhost"
  port: "9090"
  steps: "5"

Aom:
  base_url: ""
  project_id: ""
  auth_type: "token"
  auth_info:
    iam_server: ""
    iam_domain: ""
    iam_user_name: ""
    iam_password: ""
    ssl_verify: 0

Schedule:
  duration: 1

Parameter	Description	Default Value
Global
data_source	Data source	"prometheus"
Arangodb
url	IP address of the ArangoDB graph database	"http://localhost:8529"
db_name	Name of the ArangoDB database	"spider"
Kafka
server	IP address of the Kafka server. Configure according to the installation node IP address.
port	Port of the Kafka server (for example, 9092)
model_topic	Topic for reporting fault detection results	"gala_anteater_hybrid_model"
rca_topic	Topic for reporting root cause analysis results	"gala_cause_inference"
meta_topic	Topic for gopher to collect metric data	"gala_gopher_metadata"
group_id	Kafka group ID	"gala_anteater_kafka"
Prometheus
server	IP address of the Prometheus server. Configure according to the installation node IP address.
port	Port of the Prometheus server (for example, 9090)
steps	Metric sampling interval
Schedule
duration	Interval (in minutes) between anomaly detection model executions	1

Start

Start gala-anteater.

bash

systemctl start gala-anteater

Fault Injection

gala-anteater is a fault detection and root cause locating module. In the testing phase, you need to inject faults to construct fault scenarios. This allows gala-anteater to obtain information about faulty nodes and the root cause nodes of fault propagation.

Fault injection (for reference only)
bash
```
chaosblade create disk burn --size 10 --read --write --path /var/lib/docker/overlay2/cf0a469be8a84cabe1d057216505f8d64735e9c63159e170743353a208f6c268/merged --timeout 120
```
ChaosBlade is a fault injection tool that can simulate various faults, including but not limited to drive faults, network faults, and I/O faults. Note: Injecting different faults will cause corresponding fluctuations in related metrics monitored and reported to the Prometheus module by metric collectors (such as gala-gopher). These fluctuations will be visible in the Prometheus graph.

gala-anteater Service Status Query

If the following information is displayed, the service is started successfully. The startup log is saved to the logs/anteater.log file in the current running directory.

log

2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
2022-09-01 17:53:38,456 - root - INFO - Using cpu device
2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
...
2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)

Output Data of Fault Detection

If gala-anteater detects an exception, it sends the result to model_topic of Kafka. The output data format is as follows:

json

{
   "Timestamp":1659075600000,
   "Attributes":{
      "entity_id":"xxxxxx_sli_1513_18",
      "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
      "event_type":"app"
   },
   "Resource":{
      "anomaly_score":1.0,
      "anomaly_count":13,
      "total_count":13,
      "duration":60,
      "anomaly_ratio":1.0,
      "metric_label":{
         "machine_id":"1fd37742xxxx",
         "tgid":"1513",
         "conn_fd":"18"
      },
      "recommend_metrics":{
         "gala_gopher_tcp_link_notack_bytes":{
            "label":{
               "__name__":"gala_gopher_tcp_link_notack_bytes",
               "client_ip":"x.x.x.165",
               "client_port":"51352",
               "hostname":"localhost.localdomain",
               "instance":"x.x.x.172:8888",
               "job":"prometheus-x.x.x.172",
               "machine_id":"xxxxxx",
               "protocol":"2",
               "role":"0",
               "server_ip":"x.x.x.172",
               "server_port":"8888",
               "tgid":"3381701"
            },
            "score":0.24421279500639545
         },
         ...
      },
      "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
   },
   "SeverityText":"WARN",
   "SeverityNumber":14,
   "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
}

Output Data of Root Cause Locating

Each faulty node detected triggers root cause locating. Results of root cause locating are sent to rca_topic of Kafka. The output data format is as follows:

yaml

{
  "Timestamp": 1724287883452,
  "event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
  "Attributes": {
    "event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
    "event_source": "root-cause-inference"
  },
  "Resource": {
    "abnormal_kpi": {
      "metric_id": "gala_gopher_l7_latency_sum",
      "entity_id": "",
      "metric_labels": {
        "client_ip": "192.168.11.103",
        "comm": "python",
        "container_id": "83d0c2f4a7f4",
        "container_image": "ba2d060a624e",
        "container_name": "/k8s_backend_backend-node2-01-5bcb47fd7c-4jxxs_default_475ae627",
        "instance": "192.168.122.102:8888",
        "job": "192.168.122.102",
        "l4_role": "tcp_server",
        "l7_role": "server",
        "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
        "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
        "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
        "pod_namespace": "default",
        "protocol": "http",
        "server_ip": "192.168.11.102",
        "server_port": "26",
        "ssl": "no_ssl",
        "tgid": "3459438"
      },
      "desc": "L7 session averaged latency.",
      "score": 0.3498585816683402
    },
    "cause_metrics": [
      {
        "metric_id": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
        "entity_id": "",
        "metric_labels": {
          "container_id": "1319ff912a6f",
          "container_image": "ba2d060a624e",
          "container_name": "/k8s_backend_backend-node3-02-654dd97bf9-s8jg5_default_4a9fcc23",
          "instance": "192.168.122.103:8888",
          "job": "192.168.122.103",
          "machine_id": "494a61be-23cc-4c97-a871-902866e43747-192.168.122.103",
          "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
          "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
          "pod_namespace": "default"
        },
        "desc": "\u5bb9\u56681s\u5185\u7528\u6237\u6001CPU\u8d1f\u8f7d",
        "keyword": "process",
        "score": 0.1194249668036936,
        "path": [
          {
            "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
            "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
            "instance": "192.168.122.103:8888",
            "job": "192.168.122.103",
            "pod_state": "normal"
          },
          {
            "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
            "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          }
        ]
      },
      {
        "metric_id": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
        "entity_id": "",
        "metric_labels": {
          "cmdline": "python ./backend.py ",
          "comm": "python",
          "container_id": "de570c7328bb",
          "container_image": "ba2d060a624e",
          "container_name": "/k8s_backend_backend-node2-02-548c79d989-bnl9g_default_67134fb4",
          "instance": "192.168.122.102:8888",
          "job": "192.168.122.102",
          "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
          "pgid": "3459969",
          "pod": "default/backend-node2-02-548c79d989-bnl9g",
          "pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
          "pod_namespace": "default",
          "ppid": "3459936",
          "start_time": "1139543501",
          "tgid": "3459969"
        },
        "desc": "\u8fdb\u7a0b\u7cfb\u7edf\u8c03\u7528\u81f3FS\u7684\u5199\u5b57\u8282\u6570",
        "keyword": "process",
        "score": 0.37121879175399997,
        "path": [
          {
            "pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
            "pod": "default/backend-node2-02-548c79d989-bnl9g",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "normal"
          },
          {
            "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
            "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
            "instance": "192.168.122.103:8888",
            "job": "192.168.122.103",
            "pod_state": "normal"
          },
          {
            "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
            "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          }
        ]
      },
      {
        "metric_id": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952",
        "entity_id": "",
        "metric_labels": {
          "client_ip": "192.168.11.103",
          "comm": "python",
          "container_id": "eef1ca1082a7",
          "container_image": "ba2d060a624e",
          "container_name": "/k8s_backend_backend-node2-03-584f4c6cfd-w4d2b_default_956c70a2",
          "instance": "192.168.122.102:8888",
          "job": "192.168.122.102",
          "l4_role": "tcp_server",
          "l7_role": "server",
          "machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
          "pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
          "pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
          "pod_namespace": "default",
          "protocol": "http",
          "server_ip": "192.168.11.113",
          "server_port": "26",
          "ssl": "no_ssl",
          "tgid": "3460169"
        },
        "desc": "L7 session averaged latency.",
        "keyword": null,
        "score": 0.5624857367147617,
        "path": [
          {
            "pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
            "pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          },
          {
            "pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
            "pod": "default/backend-node3-02-654dd97bf9-s8jg5",
            "instance": "192.168.122.103:8888",
            "job": "192.168.122.103",
            "pod_state": "normal"
          },
          {
            "pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
            "pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
            "instance": "192.168.122.102:8888",
            "job": "192.168.122.102",
            "pod_state": "abnormal"
          }
        ]
      }
    ]
  },
  "desc": "L7 session averaged latency.",
  "top1": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929\u5f02\u5e38",
  "top2": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43\u5f02\u5e38",
  "top3": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952\u5f02\u5e38",
  "keywords": [
    "process",
    null
  ],
  "SeverityText": "WARN",
  "SeverityNumber": 13,
  "Body": "A cause inferring event for an abnormal event"
}

Using gala-anteater ​

Installation ​

Configuration ​

Configuration Parameters ​

Start ​

Fault Injection ​

gala-anteater Service Status Query ​

Output Data of Fault Detection ​

Output Data of Root Cause Locating ​