Using gala-anteater

gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.

This chapter describes how to deploy and use the gala-anteater service.

Installation

Mount the repo sources.

[oe-2209]      # openEuler 22.09 officially released repository
name=oe2209
baseurl=http://119.3.219.20:82/openEuler:/22.09/standard_x86_64
enabled=1
gpgcheck=0
priority=1

[oe-2209:Epol] # openEuler 22.09: Epol officially released repository
name=oe2209_epol
baseurl=http://119.3.219.20:82/openEuler:/22.09:/Epol/standard_x86_64/
enabled=1
gpgcheck=0
priority=1

Install gala-anteater.

# yum install gala-anteater

Configuration

Note: gala-anteater does not contain the config file that needs to be configured. Its parameters are passed through the startup parameters using the command line.

Startup Parameters

Parameter	Parameter Full Name	Type	Mandatory (Yes/No)	Default Value	Name	Description
-ks	--kafka_server	string	True		KAFKA_SERVER	IP address of the Kafka server, for example, localhost / xxx.xxx.xxx.xxx.
-kp	--kafka_port	string	True		KAFKA_PORT	Port number of the Kafka server, for example, 9092.
-ps	--prometheus_server	string	True		PROMETHEUS_SERVER	IP address of the Prometheus server, for example, localhost / xxx.xxx.xxx.xxx.
-pp	--prometheus_port	string	True		PROMETHEUS_PORT	Port number of the Prometheus server, for example, 9090.
-m	--model	string	False	vae	MODEL	Exception detection model. Currently, two exception detection models are supported: random_forest and vae. random_forest: random forest model, which does not support online learning vae: Variational Atuoencoder (VAE), which is an unsupervised model and supports model update based on historical data during the first startup.
-d	--duration	int	False	1	DURATION	Frequency of executing the exception detection model. The unit is minute, which means that the detection is performed every x minutes.
-r	--retrain	bool	False	False	RETRAIN	Whether to use historical data to update and iterate the model during startup. Currently, only the VAE model is supported.
-l	--look_back	int	False	4	LOOK_BACK	Whether to update the model based on the historical data of the last x days.
-t	--threshold	float	False	0.8	THRESHOLD	Threshold of the exception detection model, ranging from 0 to 1. A larger value can reduce the false positive rate of the model. It is recommended that the value be greater than or equal to 0.5.
-sli	--sli_time	int	False	400	SLI_TIME	Application performance metric. The unit is ms. A larger value can reduce the false positive rate of the model. It is recommended that the value be greater than or equal to 200. For scenarios with a high false positive rate, it is recommended that the value be greater than 1000.

Start

Start gala-anteater.

Note: gala-anteater can be started and run in command line mode, but cannot be started and run in systemd mode.

Running in online training mode (recommended)

gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -r True -l 7 -t 0.6 -sli 400

Running in common mode

gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -t 0.6 -sli 400

Query the gala-anteater service status.

If the following information is displayed, the service is started successfully. The startup log is saved to the logs/anteater.log file in the current running directory.

2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
2022-09-01 17:53:38,456 - root - INFO - Using cpu device
2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
...
2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)

Output Data

If gala-anteater detects an exception, it sends the result to Kafka. The output data format is as follows:

{
   "Timestamp":1659075600000,
   "Attributes":{
      "entity_id":"xxxxxx_sli_1513_18",
      "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
      "event_type":"app"
   },
   "Resource":{
      "anomaly_score":1.0,
      "anomaly_count":13,
      "total_count":13,
      "duration":60,
      "anomaly_ratio":1.0,
      "metric_label":{
         "machine_id":"1fd37742xxxx",
         "tgid":"1513",
         "conn_fd":"18"
      },
      "recommend_metrics":{
         "gala_gopher_tcp_link_notack_bytes":{
            "label":{
               "__name__":"gala_gopher_tcp_link_notack_bytes",
               "client_ip":"x.x.x.165",
               "client_port":"51352",
               "hostname":"localhost.localdomain",
               "instance":"x.x.x.172:8888",
               "job":"prometheus-x.x.x.172",
               "machine_id":"xxxxxx",
               "protocol":"2",
               "role":"0",
               "server_ip":"x.x.x.172",
               "server_port":"8888",
               "tgid":"3381701"
            },
            "score":0.24421279500639545
         },
         ...
      },
      "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
   },
   "SeverityText":"WARN",
   "SeverityNumber":14,
   "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
}

Bug Catching

Buggy Content

Bug Description

Submit As Issue

It's a little complicated....

I'd like to ask someone.

Just a small problem.

I can fix it online!

Bug Type

Specifications and Common Mistakes

● Misspellings or punctuation mistakes;

● Incorrect links, empty cells, or wrong formats;

● Chinese characters in English context;

● Minor inconsistencies between the UI and descriptions;

● Low writing fluency that does not affect understanding;

● Incorrect version numbers, including software package names and version numbers on the UI.

Usability

● Incorrect or missing key steps;

● Missing prerequisites or precautions;

● Ambiguous figures, tables, or texts;

● Unclear logic, such as missing classifications, items, and steps.

Correctness

● Technical principles, function descriptions, or specifications inconsistent with those of the software;

● Incorrect schematic or architecture diagrams;

● Incorrect commands or command parameters;

● Incorrect code;

● Commands inconsistent with the functions;

● Wrong screenshots.

Risk Warnings

● Lack of risk warnings for operations that may damage the system or important data.

Content Compliance

● Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions;

● Copyright infringement.

How satisfied are you with this document

Not satisfied at all

Very satisfied

Submit

Click to create an issue. An issue template will be automatically generated based on your feedback.