LTS

    Innovation Version

      Using gala-anteater

      gala-anteater is an AI-based operating system exception detection platform. It provides functions such as time series data preprocessing, exception detection, and exception reporting. Based on offline pre-training, online model incremental learning and model update, it can be well adapted to multi-dimensional and multi-modal data fault diagnosis.

      This chapter describes how to deploy and use the gala-anteater service.

      Installation

      Mount the repo sources.

      [oe-22.03-lts-sp1-everything] # openEuler 22.03-LTS-SP1 officially released repository
      name=oe-2203-lts-sp1-everything
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/everything/x86_64/
      enabled=1
      gpgcheck=0
      priority=1
      
      [oe-22.03-lts-sp1-epol-update] # openEuler 22.03-LTS-SP1 Update officially released repository
      name=oe-22.03-lts-sp1-epol-update
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/EPOL/update/main/x86_64/
      enabled=1
      gpgcheck=0
      priority=1
      
      [oe-22.03-lts-sp1-epol-main] # openEuler 22.03-LTS-SP1 EPOL officially released repository
      name=oe-22.03-lts-sp1-epol-main
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/EPOL/main/x86_64/
      enabled=1
      gpgcheck=0
      priority=1
      

      Install gala-anteater.

      # yum install gala-anteater
      

      Configuration

      Note: gala-anteater does not contain the config file that needs to be configured. Its parameters are passed through the startup parameters using the command line.

      Startup Parameters
      ParameterParameter Full NameTypeMandatory (Yes/No)Default ValueNameDescription
      -ks--kafka_serverstringTrueKAFKA_SERVERIP address of the Kafka server, for example, localhost / xxx.xxx.xxx.xxx.
      -kp--kafka_portstringTrueKAFKA_PORTPort number of the Kafka server, for example, 9092.
      -ps--prometheus_serverstringTruePROMETHEUS_SERVERIP address of the Prometheus server, for example, localhost / xxx.xxx.xxx.xxx.
      -pp--prometheus_portstringTruePROMETHEUS_PORTPort number of the Prometheus server, for example, 9090.
      -m--modelstringFalsevaeMODELException detection model. Currently, two exception detection models are supported: random_forest and vae.
      random_forest: random forest model, which does not support online learning
      vae: Variational Atuoencoder (VAE), which is an unsupervised model and supports model update based on historical data during the first startup.
      -d--durationintFalse1DURATIONFrequency of executing the exception detection model. The unit is minute, which means that the detection is performed every x minutes.
      -r--retrainboolFalseFalseRETRAINWhether to use historical data to update and iterate the model during startup. Currently, only the VAE model is supported.
      -l--look_backintFalse4LOOK_BACKWhether to update the model based on the historical data of the last x days.
      -t--thresholdfloatFalse0.8THRESHOLDThreshold of the exception detection model, ranging from 0 to 1. A larger value can reduce the false positive rate of the model. It is recommended that the value be greater than or equal to 0.5.
      -sli--sli_timeintFalse400SLI_TIMEApplication performance metric. The unit is ms. A larger value can reduce the false positive rate of the model. It is recommended that the value be greater than or equal to 200.
      For scenarios with a high false positive rate, it is recommended that the value be greater than 1000.

      Start

      Start gala-anteater.

      Note: gala-anteater can be started and run in command line mode, but cannot be started and run in systemd mode.

      • Running in online training mode (recommended)
      gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -r True -l 7 -t 0.6 -sli 400
      
      • Running in common mode
      gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -t 0.6 -sli 400
      

      Query the gala-anteater service status.

      If the following information is displayed, the service is started successfully. The startup log is saved to the logs/anteater.log file in the current running directory.

      2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
      2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
      2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
      2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
      2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
      2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
      2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
      2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
      2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
      2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
      2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
      2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
      2022-09-01 17:53:38,456 - root - INFO - Using cpu device
      2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
      2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
      2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
      2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
      2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
      ...
      2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
      2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
      2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
      2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
      2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
      2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
      2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
      2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)
      

      Output Data

      If gala-anteater detects an exception, it sends the result to Kafka. The output data format is as follows:

      {
         "Timestamp":1659075600000,
         "Attributes":{
            "entity_id":"xxxxxx_sli_1513_18",
            "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
            "event_type":"app"
         },
         "Resource":{
            "anomaly_score":1.0,
            "anomaly_count":13,
            "total_count":13,
            "duration":60,
            "anomaly_ratio":1.0,
            "metric_label":{
               "machine_id":"1fd37742xxxx",
               "tgid":"1513",
               "conn_fd":"18"
            },
            "recommend_metrics":{
               "gala_gopher_tcp_link_notack_bytes":{
                  "label":{
                     "__name__":"gala_gopher_tcp_link_notack_bytes",
                     "client_ip":"x.x.x.165",
                     "client_port":"51352",
                     "hostname":"localhost.localdomain",
                     "instance":"x.x.x.172:8888",
                     "job":"prometheus-x.x.x.172",
                     "machine_id":"xxxxxx",
                     "protocol":"2",
                     "role":"0",
                     "server_ip":"x.x.x.172",
                     "server_port":"8888",
                     "tgid":"3381701"
                  },
                  "score":0.24421279500639545
               },
               ...
            },
            "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
         },
         "SeverityText":"WARN",
         "SeverityNumber":14,
         "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
      }
      

      Bug Catching

      Buggy Content

      Bug Description

      Submit As Issue

      It's a little complicated....

      I'd like to ask someone.

      PR

      Just a small problem.

      I can fix it online!

      Bug Type
      Specifications and Common Mistakes

      ● Misspellings or punctuation mistakes;

      ● Incorrect links, empty cells, or wrong formats;

      ● Chinese characters in English context;

      ● Minor inconsistencies between the UI and descriptions;

      ● Low writing fluency that does not affect understanding;

      ● Incorrect version numbers, including software package names and version numbers on the UI.

      Usability

      ● Incorrect or missing key steps;

      ● Missing prerequisites or precautions;

      ● Ambiguous figures, tables, or texts;

      ● Unclear logic, such as missing classifications, items, and steps.

      Correctness

      ● Technical principles, function descriptions, or specifications inconsistent with those of the software;

      ● Incorrect schematic or architecture diagrams;

      ● Incorrect commands or command parameters;

      ● Incorrect code;

      ● Commands inconsistent with the functions;

      ● Wrong screenshots.

      Risk Warnings

      ● Lack of risk warnings for operations that may damage the system or important data.

      Content Compliance

      ● Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions;

      ● Copyright infringement.

      How satisfied are you with this document

      Not satisfied at all
      Very satisfied
      Submit
      Click to create an issue. An issue template will be automatically generated based on your feedback.
      Bug Catching
      编组 3备份