Using gala-anteater

gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.

This chapter describes how to deploy and use the gala-anteater service.

Installation

Mount the repositories.

basic
[everything]
name=everything
baseurl=http://121.36.84.172/dailybuild/EBS-openEuler-22.03-LTS-SP4/EBS-openEuler-22.03-LTS-SP4/everything/$basearch/
enabled=1
gpgcheck=0
priority=1

[EPOL]
name=EPOL
baseurl=http://repo.openeuler.org/EBS-openEuler-22.03-LTS-SP4/EPOL/main/$basearch/
enabled=1
gpgcheck=0
priority=1

Install gala-anteater.

bash
# yum install gala-anteater

Configuration

Note: Some gala-anteater parameters can be configured in /etc/gala-anteater/config/gala-anteater.yaml.

Startup Parameters

ParameterParameter Full NameTypeMandatory (Yes/No)Default ValueNameDescription
-ks--kafka_serverstringTrueKAFKA_SERVERIP address of the Kafka server, for example, localhost / xxx.xxx.xxx.xxx.
-kp--kafka_portstringTrueKAFKA_PORTPort number of the Kafka server, for example, 9092.
-ps--prometheus_serverstringTruePROMETHEUS_SERVERIP address of the Prometheus server, for example, localhost / xxx.xxx.xxx.xxx.
-pp--prometheus_portstringTruePROMETHEUS_PORTPort number of the Prometheus server, for example, 9090.
-m--modelstringFalsevaeMODELException detection model. Currently, two exception detection models are supported: random_forest and vae.
random_forest: random forest model, which does not support online learning
vae: Variational Atuoencoder (VAE), which is an unsupervised model and supports model update based on historical data during the first startup.
-d--durationintFalse1DURATIONFrequency of executing the exception detection model. The unit is minute, which means that the detection is performed every x minutes.
-r--retrainboolFalseFalseRETRAINWhether to use historical data to update and iterate the model during startup. Currently, only the VAE model is supported.
-l--look_backintFalse4LOOK_BACKWhether to update the model based on the historical data of the last x days.
-t--thresholdfloatFalse0.8THRESHOLDThreshold of the exception detection model, ranging from 0 to 1. A larger value can reduce the false positive rate of the model. It is recommended that the value be greater than or equal to 0.5.
-sli--sli_timeintFalse400SLI_TIMEApplication performance metric. The unit is ms. A larger value can reduce the false positive rate of the model. It is recommended that the value be greater than or equal to 200.
For scenarios with a high false positive rate, it is recommended that the value be greater than 1000.

Start

Start gala-anteater.

Note: gala-anteater can be started and run in command line mode, but cannot be started and run in systemd mode.

  • Running in online training mode (recommended)
bash
gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -r True -l 7 -t 0.6 -sli 400
  • Running in common mode
bash
gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -t 0.6 -sli 400

Query the gala-anteater service status.

If the following information is displayed, the service is started successfully. The startup log is saved to the logs/anteater.log file in the current running directory.

log
2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
2022-09-01 17:53:38,456 - root - INFO - Using cpu device
2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
...
2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)

Output Data

If gala-anteater detects an exception, it sends the result to Kafka. The output data format is as follows:

json
{
   "Timestamp":1659075600000,
   "Attributes":{
      "entity_id":"xxxxxx_sli_1513_18",
      "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
      "event_type":"app"
   },
   "Resource":{
      "anomaly_score":1.0,
      "anomaly_count":13,
      "total_count":13,
      "duration":60,
      "anomaly_ratio":1.0,
      "metric_label":{
         "machine_id":"1fd37742xxxx",
         "tgid":"1513",
         "conn_fd":"18"
      },
      "recommend_metrics":{
         "gala_gopher_tcp_link_notack_bytes":{
            "label":{
               "__name__":"gala_gopher_tcp_link_notack_bytes",
               "client_ip":"x.x.x.165",
               "client_port":"51352",
               "hostname":"localhost.localdomain",
               "instance":"x.x.x.172:8888",
               "job":"prometheus-x.x.x.172",
               "machine_id":"xxxxxx",
               "protocol":"2",
               "role":"0",
               "server_ip":"x.x.x.172",
               "server_port":"8888",
               "tgid":"3381701"
            },
            "score":0.24421279500639545
         },
         ...
      },
      "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
   },
   "SeverityText":"WARN",
   "SeverityNumber":14,
   "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
}