Using gala-spider 
This chapter describes how to deploy and use gala-spider and gala-inference.
gala-spider 
gala-spider provides the OS-level topology drawing function. It periodically obtains the data of all observed objects collected by gala-gopher (an OS-level data collection software) at a certain time point and calculates the topology relationship between them. The generated topology is saved to the graph database ArangoDB.
Installation 
Mount the Yum sources.
[oe-2209]      # openEuler 22.09 officially released repository
name=oe2209
baseurl=http://119.3.219.20:82/openEuler:/22.09/standard_x86_64
enabled=1
gpgcheck=0
priority=1
[oe-2209:Epol] # openEuler 22.09: Epol officially released repository
name=oe2209_epol
baseurl=http://119.3.219.20:82/openEuler:/22.09:/Epol/standard_x86_64/
enabled=1
gpgcheck=0
priority=1Install gala-spider.
# yum install gala-spiderConfiguration 
Configuration File Description 
The configuration file of gala-spider is /etc/gala-spider/gala-spider.yaml. The configuration items in this file are described as follows:
- global: global configuration information.- data_source: database for collecting observation metrics. Currently, only- prometheusis supported.
- data_agent: agent for collecting observation metrics. Currently, only- gala_gopheris supported.
 
- spider: spider configuration information.- log_conf: log configuration information.- log_path: log file path.
- log_level: level of the logs to be printed. The value can be- DEBUG,- INFO,- WARNING,- ERROR, or- CRITICAL.
- max_size: log file size, in MB.
- backup_count: number of backup log files.
 
 
- storage: configuration information about the topology storage service.- period: storage period, in seconds, indicating the interval for storing the topology.
- database: graph database for storage. Currently, only- arangodbis supported.
- db_conf: configuration information of the graph database.- url: IP address of the graph database server.
- db_name: name of the database where the topology is stored.
 
 
- kafka: Kafka configuration information.- server: Kafka server address.
- metadata_topic: topic name of the observed metadata messages.
- metadata_group_id: consumer group ID of the observed metadata messages.
 
- prometheus: Prometheus database configuration information.- base_url: IP address of the Prometheus server.
- instant_api: API for collecting data at a single time point.
- range_api: API for collecting data in a time range.
- step: collection time step, which is configured for- range_api.
 
Configuration File Example 
global:
    data_source: "prometheus"
    data_agent: "gala_gopher"
prometheus:
    base_url: "http://localhost:9090/"
    instant_api: "/api/v1/query"
    range_api: "/api/v1/query_range"
    step: 1
spider:
    log_conf:
        log_path: "/var/log/gala-spider/spider.log"
        # log level: DEBUG/INFO/WARNING/ERROR/CRITICAL
        log_level: INFO
        # unit: MB
        max_size: 10
        backup_count: 10
storage:
    # unit: second
    period: 60
    database: arangodb
    db_conf:
        url: "http://localhost:8529"
        db_name: "spider"
kafka:
    server: "localhost:9092"
    metadata_topic: "gala_gopher_metadata"
    metadata_group_id: "metadata-spider"Start 
- Run the following command to start gala-spider. sh- # spider-storage
- Use the systemd service to start gala-spider. sh- # systemctl start gala-spider
How to Use 
Deployment of External Dependent Software 
The running of gala-spider depends on multiple external software for interaction. Therefore, before starting gala-spider, you need to deploy the software on which gala-spider depends. The following figure shows the software dependency of gala-spider.
The dotted box on the right indicates the two functional components of gala-spider. The green parts indicate the external components that gala-spider directly depends on, and the gray rectangles indicate the external components that gala-spider indirectly depends on.
- spider-storage: core component of gala-spider, which provides the topology storage function. - Obtains the metadata of the observation object from Kafka.
- Obtains information about all observation object instances from Prometheus.
- Saves the generated topology to the graph database ArangoDB.
 
- gala-inference: core component of gala-spider, which provides the root cause locating function. It subscribes to abnormal KPI events from Kafka to trigger the root cause locating process of abnormal KPIs, constructs a fault propagation graph based on the topology obtained from the ArangoDB, and outputs the root cause locating result to Kafka.
- prometheus: time series database. The observation metric data collected by the gala-gopher component is reported to Prometheus for further processing.
- kafka: messaging middleware, which is used to store the observation object metadata reported by gala-gopher, exception events reported by the exception detection component gala-anteater, and root cause locating results reported by the cause-inference component.
- arangodb: graph database, which is used to store the topology generated by spider-storage.
- gala-gopher: data collection component. It must be deployed in advance.
- arangodb-ui: UI provided by ArangoDB, which can be used to query topologies.
The two functional components in gala-spider are released as independent software packages.
spider-storage: corresponds to the gala-spider software package in this section.
gala-inference: corresponds to the gala-inference software package.
For details about how to deploy the gala-gopher software, see Using gala-gopher. This section only describes how to deploy ArangoDB.
The current ArangoDB version is 3.8.7, which has the following requirements on the operating environment:
- Only the x86 system is supported.
- GCC 10 or later
For details about ArangoDB deployment, see Deployment in the ArangoDB official document.
The RPM-based ArangoDB deployment process is as follows:
- Configure the Yum sources. basic- [oe-2209] # openEuler 22.09 officially released repository name=oe2209 baseurl=http://119.3.219.20:82/openEuler:/22.09/standard_x86_64 enabled=1 gpgcheck=0 priority=1 [oe-2209:Epol] # openEuler 22.09: Epol officially released repository name=oe2209_epol baseurl=http://119.3.219.20:82/openEuler:/22.09:/Epol/standard_x86_64/ enabled=1 gpgcheck=0 priority=1
- Install arangodb3. sh- # yum install arangodb3
- Modify the configurations. - The configuration file of the arangodb3 server is /etc/arangodb3/arangod.conf. You need to modify the following configurations: - endpoint: IP address of the arangodb3 server.
- authentication: whether identity authentication is required for accessing the arangodb3 server. Currently, gala-spider does not support identity authentication. Therefore, set- authenticationto- false.
 - The following is an example. yaml- [server] endpoint = tcp://0.0.0.0:8529 authentication = false
- Start arangodb3. sh- # systemctl start arangodb3
Modifying gala-spider Configuration Items 
After the dependent software is started, you need to modify some configuration items in the gala-spider configuration file. The following is an example.
Configure the Kafka server address.
kafka:
    server: "localhost:9092"Configure the Prometheus server address.
prometheus:
    base_url: "http://localhost:9090/"Configure the IP address of the ArangoDB server.
storage:
    db_conf:
        url: "http://localhost:8529"Starting the Service 
Run systemctl start gala-spider to start the service. Run systemctl status gala-spider to check the startup status. If the following information is displayed, the startup is successful:
[root@openEuler ~]# systemctl status gala-spider
● gala-spider.service - a-ops gala spider service
     Loaded: loaded (/usr/lib/systemd/system/gala-spider.service; enabled; vendor preset: disabled)
     Active: active (running) since Tue 2022-08-30 17:28:38 CST; 1 day 22h ago
   Main PID: 2263793 (spider-storage)
      Tasks: 3 (limit: 98900)
     Memory: 44.2M
     CGroup: /system.slice/gala-spider.service
             └─2263793 /usr/bin/python3 /usr/bin/spider-storageOutput Example 
You can query the topology generated by gala-spider on the UI provided by ArangoDB. The procedure is as follows:
- Enter the IP address of the ArangoDB server in the address box of the browser, for example, http://localhost:8529. The ArangoDB UI is displayed. 
- Click DB in the upper right corner of the page to switch to the spider database. 
- On the COLLECTIONS page, you can view the collections of observation object instances and topology relationships stored in different time segments, as shown in the following figure. 
- You can query the stored topology using the AQL statements provided by ArangoDB. For details, see the AQL Documentation. 
gala-inference 
gala-inference provides the capability of locating root causes of abnormal KPIs. It uses the exception detection result and topology as the input and outputs the root cause locating result to Kafka. The gala-inference component is archived in the gala-spider project.
Installation 
Mount the Yum sources.
[oe-2209]      # openEuler 22.09 officially released repository
name=oe2209
baseurl=http://119.3.219.20:82/openEuler:/22.09/standard_x86_64
enabled=1
gpgcheck=0
priority=1
[oe-2209:Epol] # openEuler 22.09: Epol officially released repository
name=oe2209_epol
baseurl=http://119.3.219.20:82/openEuler:/22.09:/Epol/standard_x86_64/
enabled=1
gpgcheck=0
priority=1Install gala-inference.
# yum install gala-inferenceConfiguration 
Configuration File Description 
The configuration items in the gala-inference configuration file /etc/gala-inference/gala-inference.yaml are described as follows:
- inference: configuration information about the root cause locating algorithm.- tolerated_bias: tolerable time offset for querying the topology at the exception time point, in seconds.
- topo_depth: maximum depth for topology query.
- root_topk: top K root cause metrics generated in the root cause locating result.
- infer_policy: root cause derivation policy, which can be- dfsor- rw.
- sample_duration: sampling period of historical metric data, in seconds.
- evt_valid_duration: valid period of abnormal system metric events during root cause locating, in seconds.
- evt_aging_duration: aging period of abnormal metric events during root cause locating, in seconds.
 
- kafka: Kafka configuration information.- server: IP address of the Kafka server.
- metadata_topic: configuration information about the observed metadata messages.- topic_id: topic name of the observed metadata messages.
- group_id: consumer group ID of the observed metadata messages.
 
- abnormal_kpi_topic: configuration information about abnormal KPI event messages.- topic_id: topic name of the abnormal KPI event messages.
- group_id: consumer group ID of the abnormal KPI event messages.
 
- abnormal_metric_topic: configuration information about abnormal metric event messages.- topic_id: topic name of the abnormal metric event messages.
- group_id: consumer group ID of the abnormal system metric event messages.
- consumer_to: timeout interval for consuming abnormal system metric event messages, in seconds.
 
- inference_topic: configuration information about the output event messages of the root cause locating result.- topic_id: topic name of the output event messages of the root cause locating result.
 
 
- arangodb: configuration information about the ArangoDB graph database, which is used to query sub-topologies required for root cause locating.- url: IP address of the graph database server.
- db_name: name of the database where the topology is stored.
 
- log_conf: log configuration information.- log_path: log file path.
- log_level: level of the logs to be printed. The value can be- DEBUG,- INFO,- WARNING,- ERROR, or- CRITICAL.
- max_size: log file size, in MB.
- backup_count: number of backup log files.
 
- prometheus: Prometheus database configuration information, which is used to obtain historical time series data of metrics.- base_url: IP address of the Prometheus server.
- range_api: API for collecting data in a time range.
- step: collection time step, which is configured for- range_api.
 
Configuration File Example 
inference:
  # Tolerable time offset for querying the topology at the exception time point, in seconds.
  tolerated_bias: 120
  topo_depth: 10
  root_topk: 3
  infer_policy: "dfs"
  # Unit: second
  sample_duration: 600
  # Valid period of abnormal metric events during root cause locating, in seconds.
  evt_valid_duration: 120
  # Aging period of abnormal metric events, in seconds.
  evt_aging_duration: 600
kafka:
  server: "localhost:9092"
  metadata_topic:
    topic_id: "gala_gopher_metadata"
    group_id: "metadata-inference"
  abnormal_kpi_topic:
    topic_id: "gala_anteater_hybrid_model"
    group_id: "abn-kpi-inference"
  abnormal_metric_topic:
    topic_id: "gala_anteater_metric"
    group_id: "abn-metric-inference"
    consumer_to: 1
  inference_topic:
    topic_id: "gala_cause_inference"
arangodb:
  url: "http://localhost:8529"
  db_name: "spider"
log:
  log_path: "/var/log/gala-inference/inference.log"
  # log level: DEBUG/INFO/WARNING/ERROR/CRITICAL
  log_level: INFO
  # unit: MB
  max_size: 10
  backup_count: 10
prometheus:
  base_url: "http://localhost:9090/"
  range_api: "/api/v1/query_range"
  step: 5Start 
- Run the following command to start gala-inference. sh- # gala-inference
- Use the systemd service to start gala-inference. sh- # systemctl start gala-inference
How to Use 
Dependent Software Deployment 
The running dependency of gala-inference is the same as that of gala-spider. In addition, gala-inference indirectly depends on the running of gala-spider and gala-anteater. Deploy gala-spider and gala-anteater in advance.
Modify configuration items 
Modify some configuration items in the gala-inference configuration file. The following is an example.
Configure the Kafka server address.
kafka:
  server: "localhost:9092"Configure the Prometheus server address.
prometheus:
  base_url: "http://localhost:9090/"Configure the IP address of the ArangoDB server.
arangodb:
  url: "http://localhost:8529"Starting the Service 
Run systemctl start gala-inference to start the service. Run systemctl status gala-inference to check the startup status. If the following information is displayed, the startup is successful:
[root@openEuler ~]# systemctl status gala-inference
● gala-inference.service - a-ops gala inference service
     Loaded: loaded (/usr/lib/systemd/system/gala-inference.service; enabled; vendor preset: disabled)
     Active: active (running) since Tue 2022-08-30 17:55:33 CST; 1 day 22h ago
   Main PID: 2445875 (gala-inference)
      Tasks: 10 (limit: 98900)
     Memory: 48.7M
     CGroup: /system.slice/gala-inference.service
             └─2445875 /usr/bin/python3 /usr/bin/gala-inferenceOutput Example 
When the exception detection module gala-anteater detects a KPI exception, it exports the corresponding abnormal KPI event to Kafka. The gala-inference keeps monitoring the message of the abnormal KPI event. If gala-inference receives the message of the abnormal KPI event, root cause locating is triggered. The root cause locating result is exported to Kafka. You can view the root cause locating result on the Kafka server. The basic procedure is as follows:
- If Kafka is installed using the source code, go to the Kafka installation directory. sh- cd /root/kafka_2.13-2.8.0
- Run the command for consuming the topic to obtain the output of root cause locating. sh- ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic gala_cause_inference- Output example: json- { "Timestamp": 1661853360000, "event_id": "1661853360000_1fd37742xxxx_sli_12154_19", "Attributes": { "event_id": "1661853360000_1fd37742xxxx_sli_12154_19" }, "Resource": { "abnormal_kpi": { "metric_id": "gala_gopher_sli_rtt_nsec", "entity_id": "1fd37742xxxx_sli_12154_19", "timestamp": 1661853360000, "metric_labels": { "machine_id": "1fd37742xxxx", "tgid": "12154", "conn_fd": "19" } }, "cause_metrics": [ { "metric_id": "gala_gopher_proc_write_bytes", "entity_id": "1fd37742xxxx_proc_12154", "metric_labels": { "__name__": "gala_gopher_proc_write_bytes", "cmdline": "/opt/redis/redis-server x.x.x.172:3742", "comm": "redis-server", "container_id": "5a10635e2c43", "hostname": "openEuler", "instance": "x.x.x.172:8888", "job": "prometheus", "machine_id": "1fd37742xxxx", "pgid": "12154", "ppid": "12126", "tgid": "12154" }, "timestamp": 1661853360000, "path": [ { "metric_id": "gala_gopher_proc_write_bytes", "entity_id": "1fd37742xxxx_proc_12154", "metric_labels": { "__name__": "gala_gopher_proc_write_bytes", "cmdline": "/opt/redis/redis-server x.x.x.172:3742", "comm": "redis-server", "container_id": "5a10635e2c43", "hostname": "openEuler", "instance": "x.x.x.172:8888", "job": "prometheus", "machine_id": "1fd37742xxxx", "pgid": "12154", "ppid": "12126", "tgid": "12154" }, "timestamp": 1661853360000 }, { "metric_id": "gala_gopher_sli_rtt_nsec", "entity_id": "1fd37742xxxx_sli_12154_19", "metric_labels": { "machine_id": "1fd37742xxxx", "tgid": "12154", "conn_fd": "19" }, "timestamp": 1661853360000 } ] } ] }, "SeverityText": "WARN", "SeverityNumber": 13, "Body": "A cause inferring event for an abnormal event" }

