Using gala-anteater

gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.

This chapter describes how to deploy and use the gala-anteater service.

Installation

Mount the repositories.

conf
[everything]
name=everything
baseurl=http://121.36.84.172/dailybuild/EBS-24.03-LTS-SP1/EBS-24.03-LTS-SP1/everything/$basearch/
enabled=1
gpgcheck=0
priority=1

[EPOL]
name=EPOL
baseurl=http://repo.openeuler.org/openEuler-24.03-LTS-SP1/EPOL/main/$basearch/
enabled=1
gpgcheck=0
priority=1

Install gala-anteater.

bash
yum install gala-anteater

Configuration

Note: gala-anteater uses a configuration file (/etc/gala-anteater/config/gala-anteater.yaml) for its startup settings.

Configuration Parameters

yaml
Global:
  data_source: "prometheus"

Arangodb:
  url: "http://localhost:8529"
  db_name: "spider"

Kafka:f
  server: "192.168.122.100"
  port: "9092"
  model_topic: "gala_anteater_hybrid_model"
  rca_topic: "gala_cause_inference"
  meta_topic: "gala_gopher_metadata"
  group_id: "gala_anteater_kafka"
  # auth_type: plaintext/sasl_plaintext, please set "" for no auth
  auth_type: ""
  username: ""
  password: ""

Prometheus:
  server: "localhost"
  port: "9090"
  steps: "5"

Aom:
  base_url: ""
  project_id: ""
  auth_type: "token"
  auth_info:
    iam_server: ""
    iam_domain: ""
    iam_user_name: ""
    iam_password: ""
    ssl_verify: 0

Schedule:
  duration: 1
ParameterDescriptionDefault Value
Global
data_sourceData source"prometheus"
Arangodb
urlIP address of the ArangoDB graph database"http://localhost:8529"
db_nameName of the ArangoDB database"spider"
Kafka
serverIP address of the Kafka server. Configure according to the installation node IP address.
portPort of the Kafka server (for example, 9092)
model_topicTopic for reporting fault detection results"gala_anteater_hybrid_model"
rca_topicTopic for reporting root cause analysis results"gala_cause_inference"
meta_topicTopic for gopher to collect metric data"gala_gopher_metadata"
group_idKafka group ID"gala_anteater_kafka"
Prometheus
serverIP address of the Prometheus server. Configure according to the installation node IP address.
portPort of the Prometheus server (for example, 9090)
stepsMetric sampling interval
ScheduleCyclic scheduling settingsDictionary type
durationExecution interval (minutes) for the anomaly detection model1
SuppressionAlarm suppression settingsDictionary type
intervalSuppression window (minutes) for filtering duplicate alarms within this time of last alarm10

Start

Start gala-anteater.

bash
systemctl start gala-anteater

Note:

gala-anteater supports running one process instance, as multiple instances would lead to excessive memory consumption and disorganized logging.

gala-anteater Service Status Query

If the following information is displayed, the service is started successfully. The startup log is saved to the /var/log/gala-anteater/gala-anteater.log file.

log
2024-12-02 16:25:20,727 - INFO - anteater - Groups-0, metric: npu_chip_info_hbm_used_memory, start detection.
2024-12-02 16:25:20,735 - INFO - anteater - Metric-npu_chip_info_hbm_used_memory single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:20,739 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection start.
2024-12-02 16:25:21,128 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:21,137 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
2024-12-02 16:25:21,139 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
2024-12-02 16:25:21,141 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
2024-12-02 16:25:21,142 - INFO - anteater - space_nodes_compare result: [].
2024-12-02 16:25:21,142 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:21,144 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection end.

2024-12-02 16:25:21,144 - INFO - anteater - Groups-0, metric: npu_chip_info_aicore_current_freq, start detection.
2024-12-02 16:25:21,153 - INFO - anteater - Metric-npu_chip_info_aicore_current_freq single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:21,157 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection start.
2024-12-02 16:25:21,584 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:21,592 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,594 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,597 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,598 - INFO - anteater - space_nodes_compare result: [].
2024-12-02 16:25:21,598 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:21,598 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection end.

2024-12-02 16:25:21,598 - INFO - anteater - Groups-0, metric: npu_chip_roce_tx_err_pkt_num, start detection.
2024-12-02 16:25:21,607 - INFO - anteater - Metric-npu_chip_roce_tx_err_pkt_num single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:21,611 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection start.
2024-12-02 16:25:22,040 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:22,040 - INFO - anteater - Skip space nodes compare.
2024-12-02 16:25:22,040 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:22,040 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection end.

2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 1/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 2/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 3/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 4/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 5/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 6/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 7/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 8/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 9/9
2024-12-02 16:25:22,043 - INFO - anteater - SlowNodeDetector._execute costs 1.83 seconds!
2024-12-02 16:25:22,043 - INFO - anteater - END!

Output Data of Fault Detection

If gala-anteater detects an exception, it sends the result to model_topic of Kafka. The output data format is as follows:

json
{
    "Timestamp": 1730732076935, 
    "Attributes": {
        "resultCode": 201, 
        "compute": false, 
        "network": false, 
        "storage": true, 
        "abnormalDetail": [{
            "objectId": "-1", 
            "serverIp": "96.13.19.31", 
            "deviceInfo": "96.13.19.31:8888*-1", 
            "kpiId": "gala_gopher_disk_wspeed_kB", 
            "methodType": "TIME", 
            "kpiData": [], 
            "relaIds": [], 
            "omittedDevices": []
        }], 
        "normalDetail": [], 
        "errorMsg": ""
    }, 
    "SeverityText": "WARN", 
    "SeverityNumber": 13, 
    "is_anomaly": true
}

Output Fields

Output FieldUnitDescription
TimestampmsTimestamp of fault detection and reporting
resultCodeintStatus code: 201 for fault, 200 for normal operation
computeboolCompute fault flag
networkboolNetwork fault flag
storageboolStorage fault flag
abnormalDetaillistFault details
objectIdintFault object ID (-1 for node fault, 0 to 7 for the specific card)
serverIpstringFaulty object IP address
deviceInfostringDetailed fault description
kpiIdstringDetection algorithm type ("TIME" or "SPACE")
kpiDatalistFault time-series data (disabled by default)
relaIdslistRelated normal cards for comparison ("SPACE" algorithm)
omittedDeviceslistCards to exclude from display
normalDetaillistTime-series data of normal cards
errorMsgstringError description
SeverityTextstringSeverity classification ("WARN" or "ERROR")
SeverityNumberintSeverity level
is_anomalyboolFault status indicator