Using gala-anteater
gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.
This chapter describes how to deploy and use the gala-anteater service.
Installation
Mount the repositories.
[everything]
name=everything
baseurl=http://121.36.84.172/dailybuild/EBS-24.03-LTS-SP1/EBS-24.03-LTS-SP1/everything/$basearch/
enabled=1
gpgcheck=0
priority=1
[EPOL]
name=EPOL
baseurl=http://repo.openeuler.org/openEuler-24.03-LTS-SP1/EPOL/main/$basearch/
enabled=1
gpgcheck=0
priority=1
Install gala-anteater.
yum install gala-anteater
Configuration
Note: gala-anteater uses a configuration file (/etc/gala-anteater/config/gala-anteater.yaml) for its startup settings.
Configuration Parameters
Global:
data_source: "prometheus"
Arangodb:
url: "http://localhost:8529"
db_name: "spider"
Kafka:f
server: "192.168.122.100"
port: "9092"
model_topic: "gala_anteater_hybrid_model"
rca_topic: "gala_cause_inference"
meta_topic: "gala_gopher_metadata"
group_id: "gala_anteater_kafka"
# auth_type: plaintext/sasl_plaintext, please set "" for no auth
auth_type: ""
username: ""
password: ""
Prometheus:
server: "localhost"
port: "9090"
steps: "5"
Aom:
base_url: ""
project_id: ""
auth_type: "token"
auth_info:
iam_server: ""
iam_domain: ""
iam_user_name: ""
iam_password: ""
ssl_verify: 0
Schedule:
duration: 1
Parameter | Description | Default Value |
---|---|---|
Global | ||
data_source | Data source | "prometheus" |
Arangodb | ||
url | IP address of the ArangoDB graph database | "http://localhost:8529" |
db_name | Name of the ArangoDB database | "spider" |
Kafka | ||
server | IP address of the Kafka server. Configure according to the installation node IP address. | |
port | Port of the Kafka server (for example, 9092) | |
model_topic | Topic for reporting fault detection results | "gala_anteater_hybrid_model" |
rca_topic | Topic for reporting root cause analysis results | "gala_cause_inference" |
meta_topic | Topic for gopher to collect metric data | "gala_gopher_metadata" |
group_id | Kafka group ID | "gala_anteater_kafka" |
Prometheus | ||
server | IP address of the Prometheus server. Configure according to the installation node IP address. | |
port | Port of the Prometheus server (for example, 9090) | |
steps | Metric sampling interval | |
Schedule | Cyclic scheduling settings | Dictionary type |
duration | Execution interval (minutes) for the anomaly detection model | 1 |
Suppression | Alarm suppression settings | Dictionary type |
interval | Suppression window (minutes) for filtering duplicate alarms within this time of last alarm | 10 |
Start
Start gala-anteater.
systemctl start gala-anteater
Note:
gala-anteater supports running one process instance, as multiple instances would lead to excessive memory consumption and disorganized logging.
gala-anteater Service Status Query
If the following information is displayed, the service is started successfully. The startup log is saved to the /var/log/gala-anteater/gala-anteater.log file.
2024-12-02 16:25:20,727 - INFO - anteater - Groups-0, metric: npu_chip_info_hbm_used_memory, start detection.
2024-12-02 16:25:20,735 - INFO - anteater - Metric-npu_chip_info_hbm_used_memory single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:20,739 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection start.
2024-12-02 16:25:21,128 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:21,137 - INFO - anteater - dnscan labels: [-1 0 0 0 -1 0 -1 -1]
2024-12-02 16:25:21,139 - INFO - anteater - dnscan labels: [-1 0 0 0 -1 0 -1 -1]
2024-12-02 16:25:21,141 - INFO - anteater - dnscan labels: [-1 0 0 0 -1 0 -1 -1]
2024-12-02 16:25:21,142 - INFO - anteater - space_nodes_compare result: [].
2024-12-02 16:25:21,142 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:21,144 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection end.
2024-12-02 16:25:21,144 - INFO - anteater - Groups-0, metric: npu_chip_info_aicore_current_freq, start detection.
2024-12-02 16:25:21,153 - INFO - anteater - Metric-npu_chip_info_aicore_current_freq single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:21,157 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection start.
2024-12-02 16:25:21,584 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:21,592 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,594 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,597 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,598 - INFO - anteater - space_nodes_compare result: [].
2024-12-02 16:25:21,598 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:21,598 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection end.
2024-12-02 16:25:21,598 - INFO - anteater - Groups-0, metric: npu_chip_roce_tx_err_pkt_num, start detection.
2024-12-02 16:25:21,607 - INFO - anteater - Metric-npu_chip_roce_tx_err_pkt_num single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:21,611 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection start.
2024-12-02 16:25:22,040 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:22,040 - INFO - anteater - Skip space nodes compare.
2024-12-02 16:25:22,040 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:22,040 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection end.
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 1/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 2/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 3/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 4/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 5/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 6/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 7/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 8/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 9/9
2024-12-02 16:25:22,043 - INFO - anteater - SlowNodeDetector._execute costs 1.83 seconds!
2024-12-02 16:25:22,043 - INFO - anteater - END!
Output Data of Fault Detection
If gala-anteater detects an exception, it sends the result to model_topic
of Kafka. The output data format is as follows:
{
"Timestamp": 1730732076935,
"Attributes": {
"resultCode": 201,
"compute": false,
"network": false,
"storage": true,
"abnormalDetail": [{
"objectId": "-1",
"serverIp": "96.13.19.31",
"deviceInfo": "96.13.19.31:8888*-1",
"kpiId": "gala_gopher_disk_wspeed_kB",
"methodType": "TIME",
"kpiData": [],
"relaIds": [],
"omittedDevices": []
}],
"normalDetail": [],
"errorMsg": ""
},
"SeverityText": "WARN",
"SeverityNumber": 13,
"is_anomaly": true
}
Output Fields
Output Field | Unit | Description |
---|---|---|
Timestamp | ms | Timestamp of fault detection and reporting |
resultCode | int | Status code: 201 for fault, 200 for normal operation |
compute | bool | Compute fault flag |
network | bool | Network fault flag |
storage | bool | Storage fault flag |
abnormalDetail | list | Fault details |
objectId | int | Fault object ID (-1 for node fault, 0 to 7 for the specific card) |
serverIp | string | Faulty object IP address |
deviceInfo | string | Detailed fault description |
kpiId | string | Detection algorithm type ("TIME" or "SPACE") |
kpiData | list | Fault time-series data (disabled by default) |
relaIds | list | Related normal cards for comparison ("SPACE" algorithm) |
omittedDevices | list | Cards to exclude from display |
normalDetail | list | Time-series data of normal cards |
errorMsg | string | Error description |
SeverityText | string | Severity classification ("WARN" or "ERROR") |
SeverityNumber | int | Severity level |
is_anomaly | bool | Fault status indicator |