长期支持版本

    gala-anteater使用手册

    gala-anteater是一款基于AI的操作系统异常检测平台。主要提供时序数据预处理、异常点发现、异常上报等功能。基于线下预训练、线上模型的增量学习与模型更新,能够很好地适应于多维多模态数据故障诊断。

    本章主要介绍如何部署和使用gala-anteater服务。

    安装

    挂载repo源:

    [oe-2209]      # openEuler 22.09 官方发布源
    name=oe2209
    baseurl=http://119.3.219.20:82/openEuler:/22.09/standard_x86_64
    enabled=1
    gpgcheck=0
    priority=1
    
    [oe-2209:Epol] # openEuler 22.09Epol 官方发布源
    name=oe2209_epol
    baseurl=http://119.3.219.20:82/openEuler:/22.09:/Epol/standard_x86_64/
    enabled=1
    gpgcheck=0
    priority=1
    

    安装gala-anteater:

    # yum install gala-anteater
    

    配置

    说明:gala-anteater不包含额外需要配置的config文件,其参数通过命令行的启动参数传递。

    启动参数介绍
    参数项参数详细名类型是否必须默认值名称含义
    -ks--kafka_serverstringTrueKAFKA_SERVERKafka Server的ip地址,如:localhost / xxx.xxx.xxx.xxx
    -kp--kafka_portstringTrueKAFKA_PORTKafka Server的port,如:9092
    -ps--prometheus_serverstringTruePROMETHEUS_SERVERPrometheus Server的ip地址,如:localhost / xxx.xxx.xxx.xxx
    -pp--prometheus_portstringTruePROMETHEUS_PORTPrometheus Server的port,如:9090
    -m--modelstringFalsevaeMODEL异常检测模型,目前支持两种异常检测模型,可选(random_forest,vae)
    random_forest:随机森林模型,不支持在线学习
    vae:Variational Autoencoder,无监督模型,支持首次启动时,利用历史数据,进行模型更新迭代
    -d--durationintFalse1DURATION异常检测模型执行频率(单位:分),每x分钟,检测一次
    -r--retrainboolFalseFalseRETRAIN是否在启动时,利用历史数据,进行模型更新迭代,目前仅支持vae模型
    -l--look_backintFalse4LOOK_BACK利用过去x天的历史数据,更新模型
    -t--thresholdfloatFalse0.8THRESHOLD异常检测模型的阈值:(0,1),较大的值,能够减少模型的误报率,推荐大于等于0.5
    -sli--sli_timeintFalse400SLI_TIME表示应用性能指标(单位:毫秒),较大的值,能够减少模型的误报率,推荐大于等于200
    对于误报率较高的场景,推荐1000以上

    启动

    执行如下命令启动gala-anteater。

    说明:gala-anteater支持命令行方式启动运行,不支持systemd方式。

    在线训练方式运行(推荐)
    gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -r True -l 7 -t 0.6 -sli 400
    
    普通方式运行
    gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -t 0.6 -sli 400
    
    查询gala-anteater服务状态

    若日志显示如下内容,说明服务启动成功,启动日志也会保存到当前运行目录下logs/anteater.log文件中。

    2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
    2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
    2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
    2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
    2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
    2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
    2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
    2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
    2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
    2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
    2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
    2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
    2022-09-01 17:53:38,456 - root - INFO - Using cpu device
    2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
    2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
    2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
    2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
    2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
    ...
    2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
    2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
    2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
    2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
    2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
    2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
    2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
    2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)
    

    输出数据

    gala-anteater如果检测到的异常点,会将结果输出至kafka。输出数据格式如下:

    {
       "Timestamp":1659075600000,
       "Attributes":{
          "entity_id":"xxxxxx_sli_1513_18",
          "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
          "event_type":"app"
       },
       "Resource":{
          "anomaly_score":1.0,
          "anomaly_count":13,
          "total_count":13,
          "duration":60,
          "anomaly_ratio":1.0,
          "metric_label":{
             "machine_id":"1fd37742xxxx",
             "tgid":"1513",
             "conn_fd":"18"
          },
          "recommend_metrics":{
             "gala_gopher_tcp_link_notack_bytes":{
                "label":{
                   "__name__":"gala_gopher_tcp_link_notack_bytes",
                   "client_ip":"x.x.x.165",
                   "client_port":"51352",
                   "hostname":"localhost.localdomain",
                   "instance":"x.x.x.172:8888",
                   "job":"prometheus-x.x.x.172",
                   "machine_id":"xxxxxx",
                   "protocol":"2",
                   "role":"0",
                   "server_ip":"x.x.x.172",
                   "server_port":"8888",
                   "tgid":"3381701"
                },
                "score":0.24421279500639545
             },
             ...
          },
          "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
       },
       "SeverityText":"WARN",
       "SeverityNumber":14,
       "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
    }
    

    文档捉虫

    “有虫”文档片段

    问题描述

    提交类型 issue

    有点复杂...

    找人问问吧。

    PR

    小问题,全程线上修改...

    一键搞定!

    问题类型
    规范和低错类

    ● 错别字或拼写错误;标点符号使用错误;

    ● 链接错误、空单元格、格式错误;

    ● 英文中包含中文字符;

    ● 界面和描述不一致,但不影响操作;

    ● 表述不通顺,但不影响理解;

    ● 版本号不匹配:如软件包名称、界面版本号;

    易用性

    ● 关键步骤错误或缺失,无法指导用户完成任务;

    ● 缺少必要的前提条件、注意事项等;

    ● 图形、表格、文字等晦涩难懂;

    ● 逻辑不清晰,该分类、分项、分步骤的没有给出;

    正确性

    ● 技术原理、功能、规格等描述和软件不一致,存在错误;

    ● 原理图、架构图等存在错误;

    ● 命令、命令参数等错误;

    ● 代码片段错误;

    ● 命令无法完成对应功能;

    ● 界面错误,无法指导操作;

    风险提示

    ● 对重要数据或系统存在风险的操作,缺少安全提示;

    内容合规

    ● 违反法律法规,涉及政治、领土主权等敏感词;

    ● 内容侵权;

    您对文档的总体满意度

    非常不满意
    非常满意
    提交
    根据您的反馈,会自动生成issue模板。您只需点击按钮,创建issue即可。
    文档捉虫
    编组 3备份