Using gala-anteater

gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.

This chapter describes how to deploy and use the gala-anteater service.

Installation

Mount the repositories.

conf

[everything]
name=everything
baseurl=http://121.36.84.172/dailybuild/EBS-24.03-LTS-SP1/EBS-24.03-LTS-SP1/everything/$basearch/
enabled=1
gpgcheck=0
priority=1

[EPOL]
name=EPOL
baseurl=http://repo.openeuler.org/openEuler-24.03-LTS-SP1/EPOL/main/$basearch/
enabled=1
gpgcheck=0
priority=1

Install gala-anteater.

bash

yum install gala-anteater

Configuration

Note: gala-anteater uses a configuration file (/etc/gala-anteater/config/gala-anteater.yaml) for its startup settings.

Configuration Parameters

yaml

Global:
  data_source: "prometheus"

Arangodb:
  url: "http://localhost:8529"
  db_name: "spider"

Kafka:f
  server: "192.168.122.100"
  port: "9092"
  model_topic: "gala_anteater_hybrid_model"
  rca_topic: "gala_cause_inference"
  meta_topic: "gala_gopher_metadata"
  group_id: "gala_anteater_kafka"
  # auth_type: plaintext/sasl_plaintext, please set "" for no auth
  auth_type: ""
  username: ""
  password: ""

Prometheus:
  server: "localhost"
  port: "9090"
  steps: "5"

Aom:
  base_url: ""
  project_id: ""
  auth_type: "token"
  auth_info:
    iam_server: ""
    iam_domain: ""
    iam_user_name: ""
    iam_password: ""
    ssl_verify: 0

Schedule:
  duration: 1

Parameter	Description	Default Value
Global
data_source	Data source	"prometheus"
Arangodb
url	IP address of the ArangoDB graph database	"http://localhost:8529"
db_name	Name of the ArangoDB database	"spider"
Kafka
server	IP address of the Kafka server. Configure according to the installation node IP address.
port	Port of the Kafka server (for example, 9092)
model_topic	Topic for reporting fault detection results	"gala_anteater_hybrid_model"
rca_topic	Topic for reporting root cause analysis results	"gala_cause_inference"
meta_topic	Topic for gopher to collect metric data	"gala_gopher_metadata"
group_id	Kafka group ID	"gala_anteater_kafka"
Prometheus
server	IP address of the Prometheus server. Configure according to the installation node IP address.
port	Port of the Prometheus server (for example, 9090)
steps	Metric sampling interval
Schedule	Cyclic scheduling settings	Dictionary type
duration	Execution interval (minutes) for the anomaly detection model	1
Suppression	Alarm suppression settings	Dictionary type
interval	Suppression window (minutes) for filtering duplicate alarms within this time of last alarm	10

Start

Start gala-anteater.

bash

systemctl start gala-anteater

Note:
gala-anteater supports running one process instance, as multiple instances would lead to excessive memory consumption and disorganized logging.

gala-anteater Service Status Query

If the following information is displayed, the service is started successfully. The startup log is saved to the /var/log/gala-anteater/gala-anteater.log file.

log

2024-12-02 16:25:20,727 - INFO - anteater - Groups-0, metric: npu_chip_info_hbm_used_memory, start detection.
2024-12-02 16:25:20,735 - INFO - anteater - Metric-npu_chip_info_hbm_used_memory single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:20,739 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection start.
2024-12-02 16:25:21,128 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:21,137 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
2024-12-02 16:25:21,139 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
2024-12-02 16:25:21,141 - INFO - anteater - dnscan labels: [-1  0  0  0 -1  0 -1 -1]
2024-12-02 16:25:21,142 - INFO - anteater - space_nodes_compare result: [].
2024-12-02 16:25:21,142 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:21,144 - INFO - anteater - work on npu_chip_info_hbm_used_memory, slow_node_detection end.

2024-12-02 16:25:21,144 - INFO - anteater - Groups-0, metric: npu_chip_info_aicore_current_freq, start detection.
2024-12-02 16:25:21,153 - INFO - anteater - Metric-npu_chip_info_aicore_current_freq single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:21,157 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection start.
2024-12-02 16:25:21,584 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:21,592 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,594 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,597 - INFO - anteater - dnscan labels: [0 0 0 0 0 0 0 0]
2024-12-02 16:25:21,598 - INFO - anteater - space_nodes_compare result: [].
2024-12-02 16:25:21,598 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:21,598 - INFO - anteater - work on npu_chip_info_aicore_current_freq, slow_node_detection end.

2024-12-02 16:25:21,598 - INFO - anteater - Groups-0, metric: npu_chip_roce_tx_err_pkt_num, start detection.
2024-12-02 16:25:21,607 - INFO - anteater - Metric-npu_chip_roce_tx_err_pkt_num single group has data 8. ranks: [0, 1, 2, 3, 4, 5, 6, 7]
2024-12-02 16:25:21,611 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection start.
2024-12-02 16:25:22,040 - INFO - anteater - time_node_compare result: [].
2024-12-02 16:25:22,040 - INFO - anteater - Skip space nodes compare.
2024-12-02 16:25:22,040 - INFO - anteater - Time and space aggregated result: [].
2024-12-02 16:25:22,040 - INFO - anteater - work on npu_chip_roce_tx_err_pkt_num, slow_node_detection end.

2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 1/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 2/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 3/9
2024-12-02 16:25:22,041 - INFO - anteater - accomplishment: 4/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 5/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 6/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 7/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 8/9
2024-12-02 16:25:22,042 - INFO - anteater - accomplishment: 9/9
2024-12-02 16:25:22,043 - INFO - anteater - SlowNodeDetector._execute costs 1.83 seconds!
2024-12-02 16:25:22,043 - INFO - anteater - END!

Output Data of Fault Detection

If gala-anteater detects an exception, it sends the result to model_topic of Kafka. The output data format is as follows:

json

{
    "Timestamp": 1730732076935, 
    "Attributes": {
        "resultCode": 201, 
        "compute": false, 
        "network": false, 
        "storage": true, 
        "abnormalDetail": [{
            "objectId": "-1", 
            "serverIp": "96.13.19.31", 
            "deviceInfo": "96.13.19.31:8888*-1", 
            "kpiId": "gala_gopher_disk_wspeed_kB", 
            "methodType": "TIME", 
            "kpiData": [], 
            "relaIds": [], 
            "omittedDevices": []
        }], 
        "normalDetail": [], 
        "errorMsg": ""
    }, 
    "SeverityText": "WARN", 
    "SeverityNumber": 13, 
    "is_anomaly": true
}

Output Fields

Output Field	Unit	Description
Timestamp	ms	Timestamp of fault detection and reporting
resultCode	int	Status code: 201 for fault, 200 for normal operation
compute	bool	Compute fault flag
network	bool	Network fault flag
storage	bool	Storage fault flag
abnormalDetail	list	Fault details
objectId	int	Fault object ID (-1 for node fault, 0 to 7 for the specific card)
serverIp	string	Faulty object IP address
deviceInfo	string	Detailed fault description
kpiId	string	Detection algorithm type ("TIME" or "SPACE")
kpiData	list	Fault time-series data (disabled by default)
relaIds	list	Related normal cards for comparison ("SPACE" algorithm)
omittedDevices	list	Cards to exclude from display
normalDetail	list	Time-series data of normal cards
errorMsg	string	Error description
SeverityText	string	Severity classification ("WARN" or "ERROR")
SeverityNumber	int	Severity level
is_anomaly	bool	Fault status indicator

Using gala-anteater ​

Installation ​

Configuration ​

Configuration Parameters ​

Start ​

gala-anteater Service Status Query ​

Output Data of Fault Detection ​

Output Fields ​