gala-anteater使用手册
gala-anteater是一款基于AI的操作系统异常检测平台。主要提供时序数据预处理、异常点发现、异常上报等功能。基于线下预训练、线上模型的增量学习与模型更新,能够很好地适应于多维多模态数据故障诊断。
本章主要介绍如何部署和使用gala-anteater服务。
安装
挂载repo源:
[oe-2209] # openEuler 22.09 官方发布源
name=oe2209
baseurl=http://119.3.219.20:82/openEuler:/22.09/standard_x86_64
enabled=1
gpgcheck=0
priority=1
[oe-2209:Epol] # openEuler 22.09:Epol 官方发布源
name=oe2209_epol
baseurl=http://119.3.219.20:82/openEuler:/22.09:/Epol/standard_x86_64/
enabled=1
gpgcheck=0
priority=1
安装gala-anteater:
# yum install gala-anteater
配置
说明:gala-anteater不包含额外需要配置的config文件,其参数通过命令行的启动参数传递。
启动参数介绍
参数项 | 参数详细名 | 类型 | 是否必须 | 默认值 | 名称 | 含义 |
---|---|---|---|---|---|---|
-ks | --kafka_server | string | True | KAFKA_SERVER | Kafka Server的ip地址,如:localhost / xxx.xxx.xxx.xxx | |
-kp | --kafka_port | string | True | KAFKA_PORT | Kafka Server的port,如:9092 | |
-ps | --prometheus_server | string | True | PROMETHEUS_SERVER | Prometheus Server的ip地址,如:localhost / xxx.xxx.xxx.xxx | |
-pp | --prometheus_port | string | True | PROMETHEUS_PORT | Prometheus Server的port,如:9090 | |
-m | --model | string | False | vae | MODEL | 异常检测模型,目前支持两种异常检测模型,可选(random_forest,vae) random_forest:随机森林模型,不支持在线学习 vae:Variational Autoencoder,无监督模型,支持首次启动时,利用历史数据,进行模型更新迭代 |
-d | --duration | int | False | 1 | DURATION | 异常检测模型执行频率(单位:分),每x分钟,检测一次 |
-r | --retrain | bool | False | False | RETRAIN | 是否在启动时,利用历史数据,进行模型更新迭代,目前仅支持vae模型 |
-l | --look_back | int | False | 4 | LOOK_BACK | 利用过去x天的历史数据,更新模型 |
-t | --threshold | float | False | 0.8 | THRESHOLD | 异常检测模型的阈值:(0,1),较大的值,能够减少模型的误报率,推荐大于等于0.5 |
-sli | --sli_time | int | False | 400 | SLI_TIME | 表示应用性能指标(单位:毫秒),较大的值,能够减少模型的误报率,推荐大于等于200 对于误报率较高的场景,推荐1000以上 |
启动
执行如下命令启动gala-anteater。
说明:gala-anteater支持命令行方式启动运行,不支持systemd方式。
在线训练方式运行(推荐)
gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -r True -l 7 -t 0.6 -sli 400
普通方式运行
gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -t 0.6 -sli 400
查询gala-anteater服务状态
若日志显示如下内容,说明服务启动成功,启动日志也会保存到当前运行目录下logs/anteater.log
文件中。
2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!
2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
2022-09-01 17:53:38,456 - root - INFO - Using cpu device
2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0 train Loss: 136.68 validate Loss: 117.00
2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1 train Loss: 113.73 validate Loss: 110.05
2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2 train Loss: 110.60 validate Loss: 108.76
2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3 train Loss: 109.39 validate Loss: 106.93
2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4 train Loss: 106.48 validate Loss: 103.37
...
2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98 train Loss: 97.63 validate Loss: 96.76
2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99 train Loss: 97.75 validate Loss: 96.58
2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)
输出数据
gala-anteater如果检测到的异常点,会将结果输出至kafka。输出数据格式如下:
{
"Timestamp":1659075600000,
"Attributes":{
"entity_id":"xxxxxx_sli_1513_18",
"event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
"event_type":"app"
},
"Resource":{
"anomaly_score":1.0,
"anomaly_count":13,
"total_count":13,
"duration":60,
"anomaly_ratio":1.0,
"metric_label":{
"machine_id":"1fd37742xxxx",
"tgid":"1513",
"conn_fd":"18"
},
"recommend_metrics":{
"gala_gopher_tcp_link_notack_bytes":{
"label":{
"__name__":"gala_gopher_tcp_link_notack_bytes",
"client_ip":"x.x.x.165",
"client_port":"51352",
"hostname":"localhost.localdomain",
"instance":"x.x.x.172:8888",
"job":"prometheus-x.x.x.172",
"machine_id":"xxxxxx",
"protocol":"2",
"role":"0",
"server_ip":"x.x.x.172",
"server_port":"8888",
"tgid":"3381701"
},
"score":0.24421279500639545
},
...
},
"metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
},
"SeverityText":"WARN",
"SeverityNumber":14,
"Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
}
文档捉虫