LTS

    Innovation Version

      Using gala-spider

      This chapter describes how to deploy and use gala-spider and gala-inference.

      gala-spider

      gala-spider provides the OS-level topology drawing function. It periodically obtains the data of all observed objects collected by gala-gopher (an OS-level data collection software) at a certain time point and calculates the topology relationship between them. The generated topology is saved to the graph database ArangoDB.

      Installation

      Mount the Yum sources.

      [oe-22.03-lts-sp1-everything] # openEuler 22.03-LTS-SP1 official repository
      name=oe-2203-lts-sp1-everything
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/everything/x86_64/
      enabled=1
      gpgcheck=0
      priority=1
      
      [oe-22.03-lts-sp1-epol-update] # openEuler 22.03-LTS-SP1 official Update repository
      name=oe-22.03-lts-sp1-epol-update
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/EPOL/update/main/x86_64/
      enabled=1
      gpgcheck=0
      priority=1
      
      [oe-22.03-lts-sp1-epol-main] # openEuler 22.03-LTS-SP1 official EPOL repository
      name=oe-22.03-lts-sp1-epol-main
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/EPOL/main/x86_64/
      enabled=1
      gpgcheck=0
      priority=1
      

      Install gala-spider.

      yum install gala-spider
      

      Configuration

      Configuration File Description

      The configuration file of gala-spider is /etc/gala-spider/gala-spider.yaml. The configuration items in this file are described as follows:

      • global: global configuration information.
        • data_source: database for collecting observation metrics. Currently, only prometheus is supported.
        • data_agent: agent for collecting observation metrics. Currently, only gala_gopher is supported.
      • spider:
        • log_conf: log configuration information.
          • log_path: log file path.
          • log_level: level of the logs to be printed. The value can be DEBUG, INFO, WARNING, ERROR, or CRITICAL.
          • max_size: log file size, in MB.
          • backup_count: number of backup log files.
      • storage: configuration information about the topology storage service.
        • period: storage period, in seconds, indicating the interval for storing the topology.
        • database: graph database for storage. Currently, only arangodb is supported.
        • db_conf: configuration information of the graph database.
          • url: IP address of the graph database server.
          • db_name: name of the database where the topology is stored.
      • kafka: Kafka configuration information.
        • server: Kafka server address.
        • metadata_topic: topic name of the observed metadata messages.
        • metadata_group_id: consumer group ID of the observed metadata messages.
      • prometheus: Prometheus database configuration information.
        • base_url: IP address of the Prometheus server.
        • instant_api: API for collecting data at a single time point.
        • range_api: API for collecting data in a time range.
        • step: collection time step, which is configured for range_api.

      Configuration File Example

      global:
          data_source: "prometheus"
          data_agent: "gala_gopher"
      
      prometheus:
          base_url: "http://localhost:9090/"
          instant_api: "/api/v1/query"
          range_api: "/api/v1/query_range"
          step: 1
      
      spider:
          log_conf:
              log_path: "/var/log/gala-spider/spider.log"
              # log level: DEBUG/INFO/WARNING/ERROR/CRITICAL
              log_level: INFO
              # unit: MB
              max_size: 10
              backup_count: 10
      
      storage:
          # unit: second
          period: 60
          database: arangodb
          db_conf:
              url: "http://localhost:8529"
              db_name: "spider"
      
      kafka:
          server: "localhost:9092"
          metadata_topic: "gala_gopher_metadata"
          metadata_group_id: "metadata-spider"
      

      Start

      • Run the following command to start gala-spider.

        spider-storage
        
      • Use the systemd service to start gala-spider.

        systemctl start gala-spider
        

      How to Use

      Deployment of External Dependent Software

      The running of gala-spider depends on multiple external software for interaction. Therefore, before starting gala-spider, you need to deploy the software on which gala-spider depends. The following figure shows the software dependency of gala-spider.

      gala-spider-arch

      The dotted box on the right indicates the two functional components of gala-spider. The green parts indicate the external components that gala-spider directly depends on, and the gray rectangles indicate the external components that gala-spider indirectly depends on.

      • spider-storage: core component of gala-spider, which provides the topology storage function.
        1. Obtains the metadata of the observation object from Kafka.
        2. Obtains information about all observation object instances from Prometheus.
        3. Saves the generated topology to the graph database ArangoDB.
      • gala-inference: core component of gala-spider, which provides the root cause locating function. It subscribes to abnormal KPI events from Kafka to trigger the root cause locating process of abnormal KPIs, constructs a fault propagation graph based on the topology obtained from the ArangoDB, and outputs the root cause locating result to Kafka.
      • prometheus: time series database. The observation metric data collected by the gala-gopher component is reported to Prometheus for further processing.
      • kafka: messaging middleware, which is used to store the observation object metadata reported by gala-gopher, exception events reported by the exception detection component gala-anteater, and root cause locating results reported by the cause-inference component.
      • arangodb: graph database, which is used to store the topology generated by spider-storage.
      • gala-gopher: data collection component. It must be deployed in advance.
      • arangodb-ui: UI provided by ArangoDB, which can be used to query topologies.

      The two functional components in gala-spider are released as independent software packages.

      spider-storage: corresponds to the gala-spider software package in this section.

      gala-inference: corresponds to the gala-inference software package.

      For details about how to deploy the gala-gopher software, see Using gala-gopher. This section only describes how to deploy ArangoDB.

      The current ArangoDB version is 3.8.7, which has the following requirements on the operating environment:

      • Only the x86 system is supported.
      • GCC 10 or later

      For details about ArangoDB deployment, see Deployment in the ArangoDB official document.

      The RPM-based ArangoDB deployment process is as follows:

      1. Configure the Yum sources.

        [oe-22.03-lts-sp1-everything] # openEuler 22.03-LTS-SP1 official repository
        name=oe-2203-lts-sp1-everything
        baseurl=<http://repo.openeuler.org/openEuler-22.03-LTS-SP1/everything/x86_64/>
        enabled=1
        gpgcheck=0
        priority=1
        
        [oe-22.03-lts-sp1-epol-main] # openEuler 22.03-LTS-SP1 official EPOL repository
        name=oe-22.03-lts-sp1-epol-main
        baseurl=<http://repo.openeuler.org/openEuler-22.03-LTS-SP1/EPOL/main/x86_64/>
        enabled=1
        gpgcheck=0
        priority=1
        
      2. Install arangodb3.

        yum install arangodb3
        
      3. Modify the configurations.

        The configuration file of the arangodb3 server is /etc/arangodb3/arangod.conf. You need to modify the following configurations:

        • endpoint: IP address of the arangodb3 server.
        • authentication: whether identity authentication is required for accessing the arangodb3 server. Currently, gala-spider does not support identity authentication. Therefore, set authentication to false.

        The following is an example.

        [server]
        endpoint = tcp://0.0.0.0:8529
        authentication = false
        
      4. Start arangodb3.

        systemctl start arangodb3
        

      Modifying gala-spider Configuration Items

      After the dependent software is started, you need to modify some configuration items in the gala-spider configuration file. The following is an example.

      Configure the Kafka server address.

      kafka:
          server: "localhost:9092"
      

      Configure the Prometheus server address.

      prometheus:
          base_url: "http://localhost:9090/"
      

      Configure the IP address of the ArangoDB server.

      storage:
          db_conf:
              url: "http://localhost:8529"
      

      Starting the Service

      Run systemctl start gala-spider to start the service. Run systemctl status gala-spider to check the startup status. If the following information is displayed, the startup is successful:

      $ systemctl status gala-spider
      ● gala-spider.service - a-ops gala spider service
           Loaded: loaded (/usr/lib/systemd/system/gala-spider.service; enabled; vendor preset: disabled)
           Active: active (running) since Tue 2022-08-30 17:28:38 CST; 1 day 22h ago
         Main PID: 2263793 (spider-storage)
            Tasks: 3 (limit: 98900)
           Memory: 44.2M
           CGroup: /system.slice/gala-spider.service
                   └─2263793 /usr/bin/python3 /usr/bin/spider-storage
      

      Output Example

      You can query the topology generated by gala-spider on the UI provided by ArangoDB. The procedure is as follows:

      1. Enter the IP address of the ArangoDB server in the address box of the browser, for example, http://localhost:8529. The ArangoDB UI is displayed.

      2. Click DB in the upper right corner of the page to switch to the spider database.

      3. On the COLLECTIONS page, you can view the collections of observation object instances and topology relationships stored in different time segments, as shown in the following figure.

        spider topology

      4. You can query the stored topology using the AQL statements provided by ArangoDB. For details, see the AQL Documentation.

      gala-inference

      gala-inference provides the capability of locating root causes of abnormal KPIs. It uses the exception detection result and topology as the input and outputs the root cause locating result to Kafka. The gala-inference component is archived in the gala-spider project.

      Installation

      Mount the Yum sources.

      [oe-22.03-lts-sp1-everything] # openEuler 22.03-LTS-SP1 officially released repository
      name=oe-2203-lts-sp1-everything
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/everything/x86_64/
      enabled=1
      gpgcheck=0
      priority=1
      
      [oe-22.03-lts-sp1-epol-update] # openEuler 22.03-LTS-SP1 Update officially released repository
      name=oe-22.03-lts-sp1-epol-update
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/EPOL/update/main/x86_64/
      enabled=1
      gpgcheck=0
      priority=1
      
      [oe-22.03-lts-sp1-epol-main] # openEuler 22.03-LTS-SP1 EPOL officially released repository
      name=oe-22.03-lts-sp1-epol-main
      baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP1/EPOL/main/x86_64/
      enabled=1
      gpgcheck=0
      priority=1
      

      Install gala-inference.

      yum install gala-inference
      

      Configuration

      Configuration File Description

      The configuration items in the gala-inference configuration file /etc/gala-inference/gala-inference.yaml are described as follows:

      • inference: configuration information about the root cause locating algorithm.
        • tolerated_bias: tolerable time offset for querying the topology at the exception time point, in seconds.
        • topo_depth: maximum depth for topology query.
        • root_topk: yop K root cause metrics generated in the root cause locating result.
        • infer_policy: root cause derivation policy, which can be dfs or rw.
        • sample_duration: sampling period of historical metric data, in seconds.
        • evt_valid_duration: valid period of abnormal system metric events during root cause locating, in seconds.
        • evt_aging_duration: aging period of abnormal metric events during root cause locating, in seconds.
      • kafka: Kafka configuration information.
        • server: IP address of the Kafka server.
        • metadata_topic: configuration information about the observed metadata messages.
          • topic_id: topic name of the observed metadata messages.
          • group_id: consumer group ID of the observed metadata messages.
        • abnormal_kpi_topic: configuration information about abnormal KPI event messages.
          • topic_id: topic name of the abnormal KPI event messages.
          • group_id: consumer group ID of the abnormal KPI event messages.
        • abnormal_metric_topic: configuration information about abnormal metric event messages.
          • topic_id: topic name of the abnormal metric event messages.
          • group_id: consumer group ID of the abnormal system metric event messages.
          • consumer_to: timeout interval for consuming abnormal system metric event messages, in seconds.
        • inference_topic: configuration information about the output event messages of the root cause locating result.
          • topic_id: topic name of the output event messages of the root cause locating result.
      • arangodb: configuration information about the ArangoDB graph database, which is used to query sub-topologies required for root cause locating.
        • url: IP address of the graph database server.
        • db_name: name of the database where the topology is stored.
      • log_conf: log configuration information.
        • log_path: log file path.
        • log_level: level of the logs to be printed. The value can be DEBUG, INFO, WARNING, ERROR, or CRITICAL.
        • max_size: log file size, in MB.
        • backup_count: number of backup log files.
      • prometheus: Prometheus database configuration information, which is used to obtain historical time series data of metrics.
        • base_url: IP address of the Prometheus server.
        • range_api: API for collecting data in a time range.
        • step: collection time step, which is configured for range_api.

      Configuration File Example

      inference:
        # Tolerable time offset for querying the topology at the exception time point, in seconds.
        tolerated_bias: 120
        topo_depth: 10
        root_topk: 3
        infer_policy: "dfs"
        # Unit: second
        sample_duration: 600
        # Valid period of abnormal metric events during root cause locating, in seconds.
        evt_valid_duration: 120
        # Aging period of abnormal metric events, in seconds.
        evt_aging_duration: 600
      
      kafka:
        server: "localhost:9092"
        metadata_topic:
          topic_id: "gala_gopher_metadata"
          group_id: "metadata-inference"
        abnormal_kpi_topic:
          topic_id: "gala_anteater_hybrid_model"
          group_id: "abn-kpi-inference"
        abnormal_metric_topic:
          topic_id: "gala_anteater_metric"
          group_id: "abn-metric-inference"
          consumer_to: 1
        inference_topic:
          topic_id: "gala_cause_inference"
      
      arangodb:
        url: "http://localhost:8529"
        db_name: "spider"
      
      log:
        log_path: "/var/log/gala-inference/inference.log"
        # log level: DEBUG/INFO/WARNING/ERROR/CRITICAL
        log_level: INFO
        # unit: MB
        max_size: 10
        backup_count: 10
      
      prometheus:
        base_url: "http://localhost:9090/"
        range_api: "/api/v1/query_range"
        step: 5
      

      Start

      • Run the following command to start gala-inference.

        gala-inference
        
      • Use the systemd service to start gala-inference.

        systemctl start gala-inference
        

      How to Use

      Dependent Software Deployment

      The running dependency of gala-inference is the same as that of gala-spider. For details, see Deployment of External Dependent Software. In addition, gala-inference indirectly depends on the running of gala-spider and gala-anteater. Deploy gala-spider and gala-anteater in advance.

      Modify configuration items

      Modify some configuration items in the gala-inference configuration file. The following is an example.

      Configure the Kafka server address.

      kafka:
        server: "localhost:9092"
      

      Configure the Prometheus server address.

      prometheus:
        base_url: "http://localhost:9090/"
      

      Configure the IP address of the ArangoDB server.

      arangodb:
        url: "http://localhost:8529"
      

      Starting the Service

      Run systemctl start gala-inference to start the service. Run systemctl status gala-inference to check the startup status. If the following information is displayed, the startup is successful:

      $ systemctl status gala-inference
      ● gala-inference.service - a-ops gala inference service
           Loaded: loaded (/usr/lib/systemd/system/gala-inference.service; enabled; vendor preset: disabled)
           Active: active (running) since Tue 2022-08-30 17:55:33 CST; 1 day 22h ago
         Main PID: 2445875 (gala-inference)
            Tasks: 10 (limit: 98900)
           Memory: 48.7M
           CGroup: /system.slice/gala-inference.service
                   └─2445875 /usr/bin/python3 /usr/bin/gala-inference
      

      Output Example

      When the exception detection module gala-anteater detects a KPI exception, it exports the corresponding abnormal KPI event to Kafka. The gala-inference keeps monitoring the message of the abnormal KPI event. If gala-inference receives the message of the abnormal KPI event, root cause locating is triggered. The root cause locating result is exported to Kafka. You can view the root cause locating result on the Kafka server. The basic procedure is as follows:

      1. If Kafka is installed using the source code, go to the Kafka installation directory.

        cd /root/kafka_2.13-2.8.0
        
      2. Run the command for consuming the topic to obtain the output of root cause locating.

        ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic gala_cause_inference
        

        Output example:

        {
          "Timestamp": 1661853360000,
          "event_id": "1661853360000_1fd37742xxxx_sli_12154_19",
          "Atrributes": {
            "event_id": "1661853360000_1fd37742xxxx_sli_12154_19"
          },
          "Resource": {
            "abnormal_kpi": {
              "metric_id": "gala_gopher_sli_rtt_nsec",
              "entity_id": "1fd37742xxxx_sli_12154_19",
              "timestamp": 1661853360000,
              "metric_labels": {
                "machine_id": "1fd37742xxxx",
                "tgid": "12154",
                "conn_fd": "19"
              }
            },
            "cause_metrics": [
              {
                "metric_id": "gala_gopher_proc_write_bytes",
                "entity_id": "1fd37742xxxx_proc_12154",
                "metric_labels": {
                  "__name__": "gala_gopher_proc_write_bytes",
                  "cmdline": "/opt/redis/redis-server x.x.x.172:3742",
                  "comm": "redis-server",
                  "container_id": "5a10635e2c43",
                  "hostname": "openEuler",
                  "instance": "x.x.x.172:8888",
                  "job": "prometheus",
                  "machine_id": "1fd37742xxxx",
                  "pgid": "12154",
                  "ppid": "12126",
                  "tgid": "12154"
                },
                "timestamp": 1661853360000,
                "path": [
                  {
                    "metric_id": "gala_gopher_proc_write_bytes",
                    "entity_id": "1fd37742xxxx_proc_12154",
                    "metric_labels": {
                      "__name__": "gala_gopher_proc_write_bytes",
                      "cmdline": "/opt/redis/redis-server x.x.x.172:3742",
                      "comm": "redis-server",
                      "container_id": "5a10635e2c43",
                      "hostname": "openEuler",
                      "instance": "x.x.x.172:8888",
                      "job": "prometheus",
                      "machine_id": "1fd37742xxxx",
                      "pgid": "12154",
                      "ppid": "12126",
                      "tgid": "12154"
                    },
                    "timestamp": 1661853360000
                  },
                  {
                    "metric_id": "gala_gopher_sli_rtt_nsec",
                    "entity_id": "1fd37742xxxx_sli_12154_19",
                    "metric_labels": {
                      "machine_id": "1fd37742xxxx",
                      "tgid": "12154",
                      "conn_fd": "19"
                    },
                    "timestamp": 1661853360000
                  }
                ]
              }
            ]
          },
          "SeverityText": "WARN",
          "SeverityNumber": 13,
          "Body": "A cause inferring event for an abnormal event"
        }
        

      Bug Catching

      Buggy Content

      Bug Description

      Submit As Issue

      It's a little complicated....

      I'd like to ask someone.

      PR

      Just a small problem.

      I can fix it online!

      Bug Type
      Specifications and Common Mistakes

      ● Misspellings or punctuation mistakes;

      ● Incorrect links, empty cells, or wrong formats;

      ● Chinese characters in English context;

      ● Minor inconsistencies between the UI and descriptions;

      ● Low writing fluency that does not affect understanding;

      ● Incorrect version numbers, including software package names and version numbers on the UI.

      Usability

      ● Incorrect or missing key steps;

      ● Missing prerequisites or precautions;

      ● Ambiguous figures, tables, or texts;

      ● Unclear logic, such as missing classifications, items, and steps.

      Correctness

      ● Technical principles, function descriptions, or specifications inconsistent with those of the software;

      ● Incorrect schematic or architecture diagrams;

      ● Incorrect commands or command parameters;

      ● Incorrect code;

      ● Commands inconsistent with the functions;

      ● Wrong screenshots.

      Risk Warnings

      ● Lack of risk warnings for operations that may damage the system or important data.

      Content Compliance

      ● Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions;

      ● Copyright infringement.

      How satisfied are you with this document

      Not satisfied at all
      Very satisfied
      Submit
      Click to create an issue. An issue template will be automatically generated based on your feedback.
      Bug Catching
      编组 3备份