Using gala-anteater
gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.
This chapter describes how to deploy and use the gala-anteater service.
Installation
Mount the repositories.
[everything]
name=everything
baseurl=http://121.36.84.172/dailybuild/EBS-openEuler-24.09/EBS-openEuler-24.09/everything/$basearch/
enabled=1
gpgcheck=0
priority=1
[EPOL]
name=EPOL
baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP4/EPOL/main/$basearch/
enabled=1
gpgcheck=0
priority=1
Install gala-anteater.
yum install gala-anteater
Configuration
Note: gala-anteater uses a configuration file (/etc/gala-anteater/config/gala-anteater.yaml) for its startup settings.
Configuration Parameters
Global:
data_source: "prometheus"
Arangodb:
url: "http://localhost:8529"
db_name: "spider"
Kafka:f
server: "192.168.122.100"
port: "9092"
model_topic: "gala_anteater_hybrid_model"
rca_topic: "gala_cause_inference"
meta_topic: "gala_gopher_metadata"
group_id: "gala_anteater_kafka"
# auth_type: plaintext/sasl_plaintext, please set "" for no auth
auth_type: ""
username: ""
password: ""
Prometheus:
server: "localhost"
port: "9090"
steps: "5"
Aom:
base_url: ""
project_id: ""
auth_type: "token"
auth_info:
iam_server: ""
iam_domain: ""
iam_user_name: ""
iam_password: ""
ssl_verify: 0
Schedule:
duration: 1
Parameter | Description | Default Value |
---|---|---|
Global | ||
data_source | Data source | "prometheus" |
Arangodb | ||
url | IP address of the ArangoDB graph database | "http://localhost:8529" |
db_name | Name of the ArangoDB database | "spider" |
Kafka | ||
server | IP address of the Kafka server. Configure according to the installation node IP address. | |
port | Port of the Kafka server (for example, 9092) | |
model_topic | Topic for reporting fault detection results | "gala_anteater_hybrid_model" |
rca_topic | Topic for reporting root cause analysis results | "gala_cause_inference" |
meta_topic | Topic for gopher to collect metric data | "gala_gopher_metadata" |
group_id | Kafka group ID | "gala_anteater_kafka" |
Prometheus | ||
server | IP address of the Prometheus server. Configure according to the installation node IP address. | |
port | Port of the Prometheus server (for example, 9090) | |
steps | Metric sampling interval | |
Schedule | ||
duration | Interval (in minutes) between anomaly detection model executions | 1 |
Start
Start gala-anteater.
systemctl start gala-anteater
Fault Injection
gala-anteater is a fault detection and root cause locating module. In the testing phase, you need to inject faults to construct fault scenarios. This allows gala-anteater to obtain information about faulty nodes and the root cause nodes of fault propagation.
Fault injection (for reference only)
bashchaosblade create disk burn --size 10 --read --write --path /var/lib/docker/overlay2/cf0a469be8a84cabe1d057216505f8d64735e9c63159e170743353a208f6c268/merged --timeout 120
ChaosBlade is a fault injection tool that can simulate various faults, including but not limited to drive faults, network faults, and I/O faults. Note: Injecting different faults will cause corresponding fluctuations in related metrics monitored and reported to the Prometheus module by metric collectors (such as gala-gopher). These fluctuations will be visible in the Prometheus graph.
gala-anteater Service Status Query
If the following information is displayed, the service is started successfully. The startup log is saved to the logs/anteater.log file in the current running directory.
2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!
2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
2022-09-01 17:53:38,456 - root - INFO - Using cpu device
2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0 train Loss: 136.68 validate Loss: 117.00
2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1 train Loss: 113.73 validate Loss: 110.05
2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2 train Loss: 110.60 validate Loss: 108.76
2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3 train Loss: 109.39 validate Loss: 106.93
2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4 train Loss: 106.48 validate Loss: 103.37
...
2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98 train Loss: 97.63 validate Loss: 96.76
2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99 train Loss: 97.75 validate Loss: 96.58
2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)
Output Data of Fault Detection
If gala-anteater detects an exception, it sends the result to model_topic
of Kafka. The output data format is as follows:
{
"Timestamp":1659075600000,
"Attributes":{
"entity_id":"xxxxxx_sli_1513_18",
"event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
"event_type":"app"
},
"Resource":{
"anomaly_score":1.0,
"anomaly_count":13,
"total_count":13,
"duration":60,
"anomaly_ratio":1.0,
"metric_label":{
"machine_id":"1fd37742xxxx",
"tgid":"1513",
"conn_fd":"18"
},
"recommend_metrics":{
"gala_gopher_tcp_link_notack_bytes":{
"label":{
"__name__":"gala_gopher_tcp_link_notack_bytes",
"client_ip":"x.x.x.165",
"client_port":"51352",
"hostname":"localhost.localdomain",
"instance":"x.x.x.172:8888",
"job":"prometheus-x.x.x.172",
"machine_id":"xxxxxx",
"protocol":"2",
"role":"0",
"server_ip":"x.x.x.172",
"server_port":"8888",
"tgid":"3381701"
},
"score":0.24421279500639545
},
...
},
"metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
},
"SeverityText":"WARN",
"SeverityNumber":14,
"Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
}
Output Data of Root Cause Locating
Each faulty node detected triggers root cause locating. Results of root cause locating are sent to rca_topic
of Kafka. The output data format is as follows:
{
"Timestamp": 1724287883452,
"event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
"Attributes": {
"event_id": "1721125159975_475ae627-7e88-41ed-8bb8-ff0fee95a69d_l7_3459438_192.168.11.103_192.168.11.102_26_tcp_server_server_http",
"event_source": "root-cause-inference"
},
"Resource": {
"abnormal_kpi": {
"metric_id": "gala_gopher_l7_latency_sum",
"entity_id": "",
"metric_labels": {
"client_ip": "192.168.11.103",
"comm": "python",
"container_id": "83d0c2f4a7f4",
"container_image": "ba2d060a624e",
"container_name": "/k8s_backend_backend-node2-01-5bcb47fd7c-4jxxs_default_475ae627",
"instance": "192.168.122.102:8888",
"job": "192.168.122.102",
"l4_role": "tcp_server",
"l7_role": "server",
"machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
"pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
"pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
"pod_namespace": "default",
"protocol": "http",
"server_ip": "192.168.11.102",
"server_port": "26",
"ssl": "no_ssl",
"tgid": "3459438"
},
"desc": "L7 session averaged latency.",
"score": 0.3498585816683402
},
"cause_metrics": [
{
"metric_id": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
"entity_id": "",
"metric_labels": {
"container_id": "1319ff912a6f",
"container_image": "ba2d060a624e",
"container_name": "/k8s_backend_backend-node3-02-654dd97bf9-s8jg5_default_4a9fcc23",
"instance": "192.168.122.103:8888",
"job": "192.168.122.103",
"machine_id": "494a61be-23cc-4c97-a871-902866e43747-192.168.122.103",
"pod": "default/backend-node3-02-654dd97bf9-s8jg5",
"pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
"pod_namespace": "default"
},
"desc": "\u5bb9\u56681s\u5185\u7528\u6237\u6001CPU\u8d1f\u8f7d",
"keyword": "process",
"score": 0.1194249668036936,
"path": [
{
"pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
"pod": "default/backend-node3-02-654dd97bf9-s8jg5",
"instance": "192.168.122.103:8888",
"job": "192.168.122.103",
"pod_state": "normal"
},
{
"pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
"pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
"instance": "192.168.122.102:8888",
"job": "192.168.122.102",
"pod_state": "abnormal"
}
]
},
{
"metric_id": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
"entity_id": "",
"metric_labels": {
"cmdline": "python ./backend.py ",
"comm": "python",
"container_id": "de570c7328bb",
"container_image": "ba2d060a624e",
"container_name": "/k8s_backend_backend-node2-02-548c79d989-bnl9g_default_67134fb4",
"instance": "192.168.122.102:8888",
"job": "192.168.122.102",
"machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
"pgid": "3459969",
"pod": "default/backend-node2-02-548c79d989-bnl9g",
"pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
"pod_namespace": "default",
"ppid": "3459936",
"start_time": "1139543501",
"tgid": "3459969"
},
"desc": "\u8fdb\u7a0b\u7cfb\u7edf\u8c03\u7528\u81f3FS\u7684\u5199\u5b57\u8282\u6570",
"keyword": "process",
"score": 0.37121879175399997,
"path": [
{
"pod_id": "67134fb4-b2a3-43c5-a5b3-b3b463ad7d43",
"pod": "default/backend-node2-02-548c79d989-bnl9g",
"instance": "192.168.122.102:8888",
"job": "192.168.122.102",
"pod_state": "normal"
},
{
"pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
"pod": "default/backend-node3-02-654dd97bf9-s8jg5",
"instance": "192.168.122.103:8888",
"job": "192.168.122.103",
"pod_state": "normal"
},
{
"pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
"pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
"instance": "192.168.122.102:8888",
"job": "192.168.122.102",
"pod_state": "abnormal"
}
]
},
{
"metric_id": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952",
"entity_id": "",
"metric_labels": {
"client_ip": "192.168.11.103",
"comm": "python",
"container_id": "eef1ca1082a7",
"container_image": "ba2d060a624e",
"container_name": "/k8s_backend_backend-node2-03-584f4c6cfd-w4d2b_default_956c70a2",
"instance": "192.168.122.102:8888",
"job": "192.168.122.102",
"l4_role": "tcp_server",
"l7_role": "server",
"machine_id": "66086618-3bad-489e-b17d-05245224f29a-192.168.122.102",
"pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
"pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
"pod_namespace": "default",
"protocol": "http",
"server_ip": "192.168.11.113",
"server_port": "26",
"ssl": "no_ssl",
"tgid": "3460169"
},
"desc": "L7 session averaged latency.",
"keyword": null,
"score": 0.5624857367147617,
"path": [
{
"pod_id": "956c70a2-9918-459c-a0a8-39396251f952",
"pod": "default/backend-node2-03-584f4c6cfd-w4d2b",
"instance": "192.168.122.102:8888",
"job": "192.168.122.102",
"pod_state": "abnormal"
},
{
"pod_id": "4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929",
"pod": "default/backend-node3-02-654dd97bf9-s8jg5",
"instance": "192.168.122.103:8888",
"job": "192.168.122.103",
"pod_state": "normal"
},
{
"pod_id": "475ae627-7e88-41ed-8bb8-ff0fee95a69d",
"pod": "default/backend-node2-01-5bcb47fd7c-4jxxs",
"instance": "192.168.122.102:8888",
"job": "192.168.122.102",
"pod_state": "abnormal"
}
]
}
]
},
"desc": "L7 session averaged latency.",
"top1": "gala_gopher_container_cpu_user_seconds_total@4a9fcc23-8ba2-4b0a-bcb0-b1bfd89ed929\u5f02\u5e38",
"top2": "gala_gopher_proc_wchar_bytes@67134fb4-b2a3-43c5-a5b3-b3b463ad7d43\u5f02\u5e38",
"top3": "gala_gopher_l7_latency_avg@956c70a2-9918-459c-a0a8-39396251f952\u5f02\u5e38",
"keywords": [
"process",
null
],
"SeverityText": "WARN",
"SeverityNumber": 13,
"Body": "A cause inferring event for an abnormal event"
}