Using gala-anteater

gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.

This chapter describes how to deploy and use the gala-anteater service.

Installation

Mount the repo sources.

[oe-22.03-lts-sp1-everything] # openEuler 22.03-LTS-SP1 officially released repository
name=oe-2203-lts-sp1-everything
baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/everything/x86_64/
enabled=1
gpgcheck=0
priority=1

[oe-22.03-lts-sp1-epol-update] # openEuler 22.03-LTS-SP1 Update officially released repository
name=oe-22.03-lts-sp1-epol-update
baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/EPOL/update/main/x86_64/
enabled=1
gpgcheck=0
priority=1

[oe-22.03-lts-sp1-epol-main] # openEuler 22.03-LTS-SP1 EPOL officially released repository
name=oe-22.03-lts-sp1-epol-main
baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/EPOL/main/x86_64/
enabled=1
gpgcheck=0
priority=1

Install gala-anteater.

# yum install gala-anteater

Configuration

Note: gala-anteater does not contain the config file that needs to be configured. Its parameters are passed through the startup parameters using the command line.

Startup Parameters
ParameterParameter Full NameTypeMandatory (Yes/No)Default ValueNameDescription
-ks--kafka_serverstringTrueKAFKA_SERVERIP address of the Kafka server, for example, localhost / xxx.xxx.xxx.xxx.
-kp--kafka_portstringTrueKAFKA_PORTPort number of the Kafka server, for example, 9092.
-ps--prometheus_serverstringTruePROMETHEUS_SERVERIP address of the Prometheus server, for example, localhost / xxx.xxx.xxx.xxx.
-pp--prometheus_portstringTruePROMETHEUS_PORTPort number of the Prometheus server, for example, 9090.
-m--modelstringFalsevaeMODELException detection model. Currently, two exception detection models are supported: random_forest and vae.
random_forest: random forest model, which does not support online learning
vae: Variational Atuoencoder (VAE), which is an unsupervised model and supports model update based on historical data during the first startup.
-d--durationintFalse1DURATIONFrequency of executing the exception detection model. The unit is minute, which means that the detection is performed every x minutes.
-r--retrainboolFalseFalseRETRAINWhether to use historical data to update and iterate the model during startup. Currently, only the VAE model is supported.
-l--look_backintFalse4LOOK_BACKWhether to update the model based on the historical data of the last x days.
-t--thresholdfloatFalse0.8THRESHOLDThreshold of the exception detection model, ranging from 0 to 1. A larger value can reduce the false positive rate of the model. It is recommended that the value be greater than or equal to 0.5.
-sli--sli_timeintFalse400SLI_TIMEApplication performance metric. The unit is ms. A larger value can reduce the false positive rate of the model. It is recommended that the value be greater than or equal to 200.
For scenarios with a high false positive rate, it is recommended that the value be greater than 1000.

Start

Start gala-anteater.

Note: gala-anteater can be started and run in command line mode, but cannot be started and run in systemd mode.

  • Running in online training mode (recommended)
gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -r True -l 7 -t 0.6 -sli 400
  • Running in common mode
gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -t 0.6 -sli 400

Query the gala-anteater service status.

If the following information is displayed, the service is started successfully. The startup log is saved to the logs/anteater.log file in the current running directory.

2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
2022-09-01 17:53:38,456 - root - INFO - Using cpu device
2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
...
2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)

Output Data

If gala-anteater detects an exception, it sends the result to Kafka. The output data format is as follows:

{
   "Timestamp":1659075600000,
   "Attributes":{
      "entity_id":"xxxxxx_sli_1513_18",
      "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
      "event_type":"app"
   },
   "Resource":{
      "anomaly_score":1.0,
      "anomaly_count":13,
      "total_count":13,
      "duration":60,
      "anomaly_ratio":1.0,
      "metric_label":{
         "machine_id":"1fd37742xxxx",
         "tgid":"1513",
         "conn_fd":"18"
      },
      "recommend_metrics":{
         "gala_gopher_tcp_link_notack_bytes":{
            "label":{
               "__name__":"gala_gopher_tcp_link_notack_bytes",
               "client_ip":"x.x.x.165",
               "client_port":"51352",
               "hostname":"localhost.localdomain",
               "instance":"x.x.x.172:8888",
               "job":"prometheus-x.x.x.172",
               "machine_id":"xxxxxx",
               "protocol":"2",
               "role":"0",
               "server_ip":"x.x.x.172",
               "server_port":"8888",
               "tgid":"3381701"
            },
            "score":0.24421279500639545
         },
         ...
      },
      "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
   },
   "SeverityText":"WARN",
   "SeverityNumber":14,
   "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
}

Bug Catching

Buggy Content

Bug Description

Submit As Issue

It's a little complicated....

I'd like to ask someone.

PR

Just a small problem.

I can fix it online!

Bug Type
Specifications and Common Mistakes

● Misspellings or punctuation mistakes;

● Incorrect links, empty cells, or wrong formats;

● Chinese characters in English context;

● Minor inconsistencies between the UI and descriptions;

● Low writing fluency that does not affect understanding;

● Incorrect version numbers, including software package names and version numbers on the UI.

Usability

● Incorrect or missing key steps;

● Missing prerequisites or precautions;

● Ambiguous figures, tables, or texts;

● Unclear logic, such as missing classifications, items, and steps.

Correctness

● Technical principles, function descriptions, or specifications inconsistent with those of the software;

● Incorrect schematic or architecture diagrams;

● Incorrect commands or command parameters;

● Incorrect code;

● Commands inconsistent with the functions;

● Wrong screenshots.

Risk Warnings

● Lack of risk warnings for operations that may damage the system or important data.

Content Compliance

● Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions;

● Copyright infringement.

How satisfied are you with this document

Not satisfied at all
Very satisfied
Submit
Click to create an issue. An issue template will be automatically generated based on your feedback.