Using gala-gopher

As a data collection module, gala-gopher provides OS-level monitoring capabilities, supports dynamic probe installation and uninstallation, and integrates third-party probes in a non-intrusive manner to quickly expand the monitoring scope.

This chapter describes how to deploy and use the gala-gopher service.

Installation

Mount the repositories.

basic
[oe-2209]      # openEuler 23.09 officially released repository
name=oe2209
baseurl=http://119.3.219.20:82/openEuler:/23.09/standard_x86_64
enabled=1
gpgcheck=0
priority=1

[oe-2209:Epol] # openEuler 23.09: Epol officially released repository
name=oe2209_epol
baseurl=http://119.3.219.20:82/openEuler:/23.09:/Epol/standard_x86_64/
enabled=1
gpgcheck=0
priority=1

Install gala-gopher.

bash
# yum install gala-gopher

Configuration

Configuration Description

The configuration file of gala-gopher is /opt/gala-gopher/gala-gopher.conf. The configuration items in the file are described as follows (the parts that do not need to be manually configured are not described):

The following configurations can be modified as required:

  • global: global configuration for gala-gopher.
    • log_file_name: name of the gala-gopher log file.
    • log_level: gala-gopher log level (currently not enabled).
    • pin_path: path for storing the map shared by the eBPF probe (keep the default configuration).
  • metric: configuration for metric data output.
    • out_channel: output channel for metrics (web_server, logs, or kafka). If empty, the output channel is disabled.
    • kafka_topic: topic configuration for Kafka output.
  • event: configuration for abnormal event output.
    • out_channel: output channel for events (logs or kafka). If empty, the output channel is disabled.
    • kafka_topic: topic configuration for Kafka output.
    • timeout: reporting interval for the same event.
    • desc_language: language for event descriptions (zh_CN or en_US).
  • meta: configuration for metadata output.
    • out_channel: output channel for metadata (logs or kafka). If empty, the output channel is disabled.
    • kafka_topic: topic configuration for Kafka output.
  • ingress: probe data reporting configuration (currently unused).
    • interval: unused.
  • egress: database reporting configuration (currently unused).
    • interval: unused.
    • time_range: unused.
  • imdb: cache configuration.
    • max_tables_num: maximum number of cache tables. Each meta file in /opt/gala-gopher/meta corresponds to a table.
    • max_records_num: maximum records per cache table. Each probe typically generates at least one record per observation period.
    • max_metrics_num: maximum number of metrics per record.
    • record_timeout: cache table aging time (seconds). Records not updated within this time are deleted.
  • web_server: web_server output channel configuration.
    • port: listening port.
  • rest_api_server:
    • port: listening port for the REST API.
    • ssl_auth: enables HTTPS encryption and authentication for the REST API (on or off). Enable in production.
    • private_key: absolute path to the server's private key file for HTTPS encryption (required if ssl_auth is on).
    • cert_file: absolute path to the server's certificate file for HTTPS encryption (required if ssl_auth is on).
    • ca_file: absolute path to the CA certificate for client authentication (required if ssl_auth is on).
  • kafka: Kafka output channel configuration.
    • kafka_broker: IP address and port of the Kafka server.
    • batch_num_messages: number of messages per batch.
    • compression_codec: message compression type.
    • queue_buffering_max_messages: maximum number of messages in the producer buffer.
    • queue_buffering_max_kbytes: maximum size (KB) of the producer buffer.
    • queue_buffering_max_ms: maximum time (ms) the producer waits for more messages before sending a batch.
  • logs: logs output channel configuration.
    • metric_dir: path for metric data logs.
    • event_dir: path for abnormal event logs.
    • meta_dir: path for metadata logs.
    • debug_dir: path for gala-gopher runtime logs.

Configuration File Example

  • Select the data output channels.

    yaml
    metric =
    {
        out_channel = "web_server";
        kafka_topic = "gala_gopher";
    };
    
    event =
    {
        out_channel = "kafka";
        kafka_topic = "gala_gopher_event";
    };
    
    meta =
    {
        out_channel = "kafka";
        kafka_topic = "gala_gopher_metadata";
    };
  • Configure Kafka and Web Server.

    yaml
    web_server =
    {
        port = 8888;
    };
    
    kafka =
    {
        kafka_broker = "<Kafka server IP address>:9092";
    };
  • Select the probe to be enabled. The following is an example.

    yaml
    probes =
    (
        {
            name = "system_infos";
            param = "-t 5 -w /opt/gala-gopher/task_whitelist.conf -l warn -U 80";
            switch = "on";
        },
    );
    extend_probes =
    (
        {
            name = "tcp";
            command = "/opt/gala-gopher/extend_probes/tcpprobe";
            param = "-l warn -c 1 -P 7";
            switch = "on";
        }
    );

Start

After the configuration is complete, start gala-gopher.

bash
# systemctl start gala-gopher.service

Query the status of the gala-gopher service.

bash
# systemctl status gala-gopher.service

If the following information is displayed, the service is started successfully: Check whether the enabled probe is started. If the probe thread does not exist, check the configuration file and gala-gopher run log file.

Note: The root permission is required for deploying and running gala-gopher.

How to Use

Deployment of External Dependent Software

As shown in the preceding figure, the green parts are external dependent components of gala-gopher. gala-gopher outputs metric data to Prometheus, metadata and abnormal events to Kafka. gala-anteater and gala-spider in gray rectangles obtain data from Prometheus and Kafka.

Note: Obtain the installation packages of Kafka and Prometheus from the official websites.

REST Dynamic Configuration Interface

The web server port is configurable (default is 9999). The URL format is http://[gala-gopher-node-ip-address]:[port]/[function (collection feature)]. For example, the URL for the flamegraph is http://localhost:9999/flamegraph (the following documentation uses the flamegraph as an example).

Configuring the Probe Monitoring Scope

Probes are disabled by default and can be dynamically enabled and configured via the API. Taking the flamegraph as an example, the REST API can be used to enable oncpu, offcpu, and mem flamegraph capabilities. The monitoring scope can be configured based on four dimensions: process ID, process name, container ID, and pod.

Below is an example of an API that simultaneously enables the oncpu and offcpu collection features for the flamegraph:

sh
curl -X PUT http://localhost:9999/flamegraph --data-urlencode json='
{
    "cmd": {
        "bin": "/opt/gala-gopher/extend_probes/stackprobe",
        "check_cmd": "",
        "probe": [
            "oncpu",
            "offcpu"
        ]
    },
    "snoopers": {
        "proc_id": [
            101,
            102
        ],
        "proc_name": [
            {
                "comm": "app1",
                "cmdline": "",
                "debugging_dir": ""
            },
            {
                "comm": "app2",
                "cmdline": "",
                "debugging_dir": ""
            }
        ],
        "pod_id": [
            "pod1",
            "pod2"
        ],
        "container_id": [
            "container1",
            "container2"
        ]
    }
}'

A full description of the collection features is provided below:

Collection FeatureDescriptionSub-item ScopeMonitoring TargetsStartup FileStartup Condition
flamegraphOnline performance flamegraph observationoncpu, offcpu, memproc_id, proc_name, pod_id, container_id$gala-gopher-dir/stackprobeNA
l7Application layer 7 protocol observationl7_bytes_metrics, l7_rpc_metrics, l7_rpc_traceproc_id, proc_name, pod_id, container_id$gala-gopher-dir/l7probeNA
tcpTCP exception and state observationtcp_abnormal, tcp_rtt, tcp_windows, tcp_rate, tcp_srtt, tcp_sockbuf, tcp_stats, tcp_delayproc_id, proc_name, pod_id, container_id$gala-gopher-dir/tcpprobeNA
socketSocket (TCP/UDP) exception observationtcp_socket, udp_socketproc_id, proc_name, pod_id, container_id$gala-gopher-dir/endpointNA
ioBlock layer I/O observationio_trace, io_err, io_count, page_cacheNA$gala-gopher-dir/ioprobeNA
procProcess system calls, I/O, DNS, VFS observationbase_metrics, proc_syscall, proc_fs, proc_io, proc_dns, proc_pagecacheproc_id, proc_name, pod_id, container_id$gala-gopher-dir/taskprobeNA
jvmJVM layer GC, threads, memory, cache observationNAproc_id, proc_name, pod_id, container_id$gala-gopher-dir/jvmprobeNA
ksliRedis performance SLI (access latency) observationNAproc_id, proc_name, pod_id, container_id$gala-gopher-dir/ksliprobeNA
postgre_sliPG DB performance SLI (access latency) observationNAproc_id, proc_name, pod_id, container_id$gala-gopher-dir/pgsliprobeNA
opengauss_sliopenGauss access throughput observationNA[ip, port, dbname, user, password]$gala-gopher-dir/pg_stat_probe.pyNA
dnsmasqDNS session observationNAproc_id, proc_name, pod_id, container_id$gala-gopher-dir/rabbitmq_probe.shNA
lvsLVS session observationNANA$gala-gopher-dir/trace_lvslsmod|grep ip_vs| wc -l
nginxNginx L4/L7 layer session observationNAproc_id, proc_name, pod_id, container_id$gala-gopher-dir/nginx_probeNA
haproxyHaproxy L4/7 layer session observationNAproc_id, proc_name, pod_id, container_id$gala-gopher-dir/trace_haproxyNA
kafkaKafka producer/consumer topic observationNAdev, port$gala-gopher-dir/kafkaprobeNA
baseinfoSystem basic informationcpu, mem, nic, disk, net, fs, proc, hostproc_id, proc_name, pod_id, container_idsystem_infosNA
virtVirtualization management informationNANAvirtualized_infosNA
tprofilingThread-level performance profiling observationoncpu, syscall_file, syscall_net, syscall_lock, syscall_schedproc_id, proc_name

Configuring Probe Runtime Parameters

Probes require additional parameter settings during runtime, such as configuring the sampling period and reporting period for flamegraphs.

sh
curl -X PUT http://localhost:9999/flamegraph --data-urlencode json='
{
    "params": {
        "report_period": 180,
        "sample_period": 180,
        "metrics_type": [
            "raw",
            "telemetry"
        ]
    }
}'

Detailed runtime parameters are as follows:

ParameterDescriptionDefault & RangeUnitSupported Monitoring ScopeSupported by gala-gopher
sample_periodSampling period5000, [100~10000]msio, tcpY
report_periodReporting period60, [5~600]sALLY
latency_thrLatency reporting threshold0, [10~100000]mstcp, io, proc, ksliY
offline_thrProcess offline reporting threshold0, [10~100000]msprocY
drops_thrPacket loss reporting threshold0, [10~100000]packagetcp, nicY
res_lower_thrResource percentage lower limit0%, [0%~100%]percentALLY
res_upper_thrResource percentage upper limit0%, [0%~100%]percentALLY
report_eventReport abnormal events0, [0, 1]NAALLY
metrics_typeReport telemetry metricsraw, [raw, telemetry]NAALLN
envWorking environment typenode, [node, container, kubenet]NAALLN
report_source_portReport source port0, [0, 1]NAtcpY
l7_protocolLayer 7 protocol scopehttp, [http, pgsql, mysql, redis, kafka, mongo, rocketmq, dns]NAl7Y
support_sslSupport SSL encrypted protocol observation0, [0, 1]NAl7Y
multi_instanceOutput separate flamegraphs for each process0, [0, 1]NAflamegraphY
native_stackDisplay native language stack (for JAVA processes)0, [0, 1]NAflamegraphY
cluster_ip_backendPerform Cluster IP backend conversion0, [0, 1]NAtcp, l7Y
pyroscope_serverSet flamegraph UI server addresslocalhost:4040NAflamegraphY
svg_periodFlamegraph SVG file generation period180, [30, 600]sflamegraphY
perf_sample_periodPeriod for collecting stack info in oncpu flamegraph10, [10, 1000]msflamegraphY
svg_dirDirectory for storing flamegraph SVG files"/var/log/gala-gopher/stacktrace"NAflamegraphY
flame_dirDirectory for storing raw stack info in flamegraphs"/var/log/gala-gopher/flamegraph"NAflamegraphY
dev_nameObserved network card/disk device name""NAio, kafka, ksli, postgre_sli, baseinfo, tcpY
continuous_samplingEnable continuous sampling0, [0, 1]NAksliY
elf_pathPath to the executable file to observe""NAnginx, haproxy, dnsmasqY
kafka_portKafka port number to observe9092, [1, 65535]NAkafkaY
cadvisor_portPort number for starting cadvisor8080, [1, 65535]NAcadvisorY

Starting and Stopping Probes

sh
curl -X PUT http://localhost:9999/flamegraph --data-urlencode json='
{
    "state": "running" // optional: running, stopped
}'

Constraints and Limitations

  1. The interface is stateless. The settings uploaded each time represent the final runtime configuration for the probe, including state, parameters, and monitoring scope.
  2. Monitoring targets can be combined arbitrarily, and the monitoring scope is the union of all specified targets.
  3. The startup file must be valid and accessible.
  4. Collection features can be enabled partially or fully as needed, but disabling a feature requires disabling it entirely.
  5. The monitoring target for opengauss is a DB instance (IP/Port/dbname/user/password).
  6. The interface can receive a maximum of 2048 characters per request.

Querying Probe Configurations and Status

sh
curl -X GET http://localhost:9999/flamegraph
{
    "cmd": {
        "bin": "/opt/gala-gopher/extend_probes/stackprobe",
        "check_cmd": ""
        "probe": [
            "oncpu",
            "offcpu"
        ]
    },
    "snoopers": {
        "proc_id": [
            101,
            102
        ],
        "proc_name": [
            {
                "comm": "app1",
                "cmdline": "",
                "debugging_dir": ""
            },
            {
                "comm": "app2",
                "cmdline": "",
                "debugging_dir": ""
            }
        ],
        "pod_id": [
            "pod1",
            "pod2"
        ],
        "container_id": [
            "container1",
            "container2"
        ]
    },
    "params": {
        "report_period": 180,
        "sample_period": 180,
        "metrics_type": [
            "raw",
            "telemetry"
        ]
    },
    "state": "running"
}

Introduction to stackprobe

A performance flamegraph tool designed for cloud-native environments.

Features

  • Supports observation of applications written in C/C++, Go, Rust, and Java.
  • Call stack supports container and process granularity: For processes within containers, the workload Pod name and container name are marked with [Pod] and [Con] prefixes at the bottom of the call stack. Process names are prefixed with [<pid>], while threads and functions (methods) have no prefix.
  • Supports generating SVG format flamegraphs locally or uploading call stack data to middleware.
  • Supports generating/uploading flamegraphs for multiple instances based on process granularity.
  • For Java processes, flamegraphs can simultaneously display native methods and Java methods.
  • Supports multiple types of flamegraphs, including oncpu, offcpu, and mem.
  • Supports custom sampling periods.

Usage Instructions

Basic startup command example: Start the performance flamegraph with default parameters.

sh
curl -X PUT http://localhost:9999/flamegraph -d json='{ "cmd": {"probe": ["oncpu"] }, "snoopers": {"proc_name": [{ "comm": "cadvisor"}] }, "state": "running"}'

Advanced startup command example: Start the performance flamegraph with custom parameters. For a complete list of configurable parameters, refer to Configuring Probe Runtime Parameters.

sh
curl -X PUT http://localhost:9999/flamegraph -d json='{ "cmd": {  "check_cmd": "",  "probe": ["oncpu", "offcpu", "mem"] }, "snoopers": {  "proc_name": [{   "comm": "cadvisor",   "cmdline": "",   "debugging_dir": ""  }, {   "comm": "java",   "cmdline": "",   "debugging_dir": ""  }] }, "params": {  "perf_sample_period": 100,  "svg_period": 300,  "svg_dir": "/var/log/gala-gopher/stacktrace",  "flame_dir": "/var/log/gala-gopher/flamegraph",  "pyroscope_server": "localhost:4040",  "multi_instance": 1,  "native_stack": 0 }, "state": "running"}'

Key configuration options explained:

  • Enabling flamegraph types:

    Set via the probe parameter. Values include oncpu, offcpu, and mem, representing CPU usage time, blocked time, and memory allocation statistics, respectively.

    Example:

    "probe": ["oncpu", "offcpu", "mem"]

  • Setting the period for generating local SVG flamegraphs:

    Configured via the svg_period parameter, in seconds. Default is 180, with an optional range of [30, 600].

    Example:

    "svg_period": 300

  • Enabling/disabling stack information upload to Pyroscope:

    Set via the pyroscope_server parameter. The value must include the address and port. If empty or incorrectly formatted, the probe will not attempt to upload stack information. The upload period is 30 seconds.

    Example:

    "pyroscope_server": "localhost:4040"

  • Setting the call stack sampling period:

    Configured via the perf_sample_period parameter, in milliseconds. Default is 10, with an optional range of [10, 1000]. This parameter only applies to oncpu flamegraphs.

    Example:

    "perf_sample_period": 100

  • Enabling/disabling multi-instance flamegraph generation:

    Set via the multi_instance parameter, with values 0 or 1. Default is 0. A value of 0 merges flamegraphs for all processes, while 1 generates separate flamegraphs for each process.

    Example:

    "multi_instance": 1

  • Enabling/disabling native call stack collection:

    Set via the native_stack parameter, with values 0 or 1. Default is 0. This parameter only applies to Java processes. A value of 0 disables collection of the JVM's native call stack, while 1 enables it.

    Example:

    "native_stack": 1

    Visualization: (Left: "native_stack": 1, Right: "native_stack": 0)

    image-20230804172905729

Implementation Plan

1. User-Space Program Logic

The program periodically (every 30 seconds) converts kernel-reported stack information from addresses to symbols using the symbol table. It then uses the flamegraph plugin or pyroscope to generate a flame graph from the symbolized call stack.

The approach to obtaining the symbol table differs based on the code segment type.

  • Kernel Symbol Table: Access /proc/kallsyms.

  • Native Language Symbol Table: Query the process virtual memory mapping file (/proc/{pid}/maps) to retrieve address mappings for each code segment in the process memory. The libelf library is then used to load the symbol table of the corresponding module for each segment.

  • Java Language Symbol Table:

    Since Java methods are not statically mapped to the process virtual address space, alternative methods are used to obtain the symbolized Java call stack.

Method 1: Perf Observation

A JVM agent dynamic library is loaded into the Java process to monitor JVM method compilation and loading events. This allows real-time recording of memory address-to-Java symbol mappings, generating the Java process symbol table. This method requires the Java process to be launched with the -XX:+PreserveFramePointer option. Its key advantage is that the flame graph can display the JVM call stack, and the resulting Java flame graph can be merged with those of other processes for unified visualization.

Method 2: JFR Observation

The JVM built-in profiler, Java Flight Recorder (JFR), is dynamically enabled to monitor various events and metrics of the Java application. This is accomplished by loading a Java agent into the Java process, which internally calls the JFR API. This method offers the advantage of more precise and comprehensive collection of Java method call stacks.

Both Java performance analysis methods can be loaded in real time (without restarting the Java process) and feature low overhead. When stackprobe startup parameters are configured as "multi_instance": 1 and "native_stack": 0, it uses Method 2 to generate the Java process flame graph; otherwise, it defaults to Method 1.

2. Kernel-Space Program Logic

The kernel-space functionality is implemented using eBPF. Different flame graph types correspond to distinct eBPF programs. These programs periodically or through event triggers traverse the current user-space and kernel-space call stacks, reporting the results to user space.

2.1 On-CPU Flame Graph

A sampling eBPF program is attached to perf software event PERF_COUNT_SW_CPU_CLOCK to periodically sample the call stack.

2.2 Off-CPU Flame Graph

A sampling eBPF program is attached to process scheduling tracepoint sched_switch. This program records the time and process ID when a process is scheduled out and samples the call stack when the process is scheduled back in.

2.3 Memory Flame Graph

A sampling eBPF program is attached to page fault tracepoint page_fault_user. The call stack is sampled whenever this event is triggered.

3. Java Language Support

  • stackprobe main process:

    1. Receives an IPC message to identify the Java process to be observed.
    2. Utilizes the Java agent loading module to inject the JVM agent program into the target Java process: jvm_agent.so (for Method 1) or JstackProbeAgent.jar (for Method 2).
    3. For Method 1, the main process loads the java-symbols.bin file of the corresponding Java process to facilitate address-to-symbol conversion. For Method 2, it loads the stacks-{flame_type}.txt file of the corresponding Java process, which can be directly used to generate flame graphs.
  • Java agent loading module:

    1. Detects a new Java process and copies the JVM agent program to /proc/<pid>/root/tmp in the process space (to ensure visibility to the JVM inside the container during attachment).
    2. Adjusts the ownership of the directory and JVM agent program to match the observed Java process.
    3. Launches the jvm_attach subprocess and passes the relevant parameters of the observed Java process.
  • JVM agent program:

    • jvm_agent.so: Registers JVMTI callback functions.

      When the JVM loads a Java method or dynamically compiles a native method, it triggers the callback function. The callback records the Java class name, method name, and corresponding memory address in /proc/<pid>/root/tmp/java-data-<pid>/java-symbols.bin within the observed Java process space.

    • JstackProbeAgent.jar: Invokes the JFR API.

      Activates JFR for 30 seconds and transforms the JFR statistics into a stack format suitable for flame graphs. The output is saved to /proc/<pid>/root/tmp/java-data-<pid>/stacks-<flame_type>.txt in the observed Java process space. For more information, refer to JstackProbe Introduction.

  • jvm_attach: Dynamically loads the JVM agent program into the JVM of the observed process (based on sun.tools.attach.LinuxVirtualMachine from the JDK source code and the jattach tool).

    1. Configures its own namespace (the JVM requires the attaching process and the observed process to share the same namespace for agent loading).
    2. Verifies if the JVM attach listener is active (by checking for the existence of the UNIX socket file /proc/<pid>/root/tmp/.java_pid<pid>).
    3. If inactive, creates /proc/<pid>/cwd/.attach_pid<pid> and sends a SIGQUIT signal to the JVM.
    4. Establishes a connection to the UNIX socket.
    5. Interprets the response; a value of 0 indicates successful attachment.

    Attachment process diagram:

    Attachment process

Precautions

  • To achieve the best observation results for Java applications, configure the stackprobe startup options to "multi_instance": 1 and "native_stack": 0 to enable JFR observation (JDK8u262+). Otherwise, stackprobe will use perf to generate Java flame graphs. When using perf, ensure that the JVM option XX:+PreserveFramePointer is enabled (JDK8 or later).

Constraints

  • Supports observation of Java applications based on the hotSpot JVM.

Introduction to tprofiling

tprofiling, a thread-level application performance diagnostic tool provided by gala-gopher, leverages eBPF technology. It monitors key system performance events at the thread level, associating them with detailed event content. This enables real-time recording of thread states and key activities, helping users quickly pinpoint application performance bottlenecks.

Features

From the OS perspective, a running application comprises multiple processes, each containing multiple running threads. tprofiling monitors and records key activities (referred to as events) performed by these threads. The tool then presents these events on a timeline in the front-end interface, providing an intuitive view of what each thread is doing at any given moment—whether it is executing on the CPU or blocked by file or network I/O operations. When performance issues arise, analyzing the sequence of key performance events for a given thread enables rapid problem isolation and localization.

Currently, with its implemented event monitoring capabilities, tprofiling can identify application performance issues such as:

  • File I/O latency and blocking
  • Network I/O latency and blocking
  • Lock contention
  • Deadlocks

As more event types are added and refined, tprofiling will cover a broader range of application performance problems.

Event Observation Scope

tprofiling currently supports two main categories of system performance events: syscall events and on-CPU events.

Syscall Events

Application performance often suffers from system resource bottlenecks like excessive CPU usage or I/O wait times. Applications typically access these resources through syscalls. Observation key syscall events helps identify time-consuming or blocking resource access operations.

The syscall events currently observed by tprofiling are detailed in the Supported Syscall Events section. These events fall into categories such as file operations, network operations, lock operations, and scheduling operations. Examples of observed syscall events include:

  • File Operations
    • read/write: Reading from or writing to disk files or network connections; these operations can be time-consuming or blocking.
    • sync/fsync: Synchronizing file data to disk, which blocks the thread until completion.
  • Network Operations
    • send/recv: Reading from or writing to network connections; these operations can be time-consuming or blocking.
  • Lock Operations
    • futex: A syscall related to user-mode lock implementations. A futex call often indicates lock contention, potentially causing threads to block.
  • Scheduling Operations: These syscall events can change a thread's state, such as yielding the CPU, sleeping, or waiting for other threads.
    • nanosleep: The thread enters a sleep state.
    • epoll_wait: The thread waits for I/O events, blocking until an event arrives.

on-CPU Events

A thread's running state can be categorized as either on-CPU (executing on a CPU core) or off-CPU (not executing). Observation on-CPU events helps identify threads performing time-consuming CPU-bound operations.

Event Content

Thread profiling events include the following information:

  • Event Source: This includes the thread ID, thread name, process ID, process name, container ID, container name, host ID, and host name associated with the event.

    • thread.pid: The thread ID.
    • thread.comm: The thread name.
    • thread.tgid: The process ID.
    • proc.name: The process name.
    • container.id: The container ID.
    • container.name: The container name.
    • host.id: The host ID.
    • host.name: The host name.
  • Event Attributes: These include common attributes and extended attributes.

    • Common Attributes: These include the event name, event type, start time, end time, and duration.

      • event.name: The event name.
      • event.type: The event type, which can be oncpu, file, net, lock, or sched.
      • start_time: The event start time, which is the start time of the first event in an aggregated event. See Aggregated Events for more information.
      • end_time: The event end time, which is the end time of the last event in an aggregated event.
      • duration: The event duration, calculated as (end_time - start_time).
      • count: The number of aggregated events.
    • Extended Attributes: These provide more detailed information specific to each syscall event. For example, read and write events for files or network connections include the file path, network connection details, and function call stack.

      • func.stack: The function call stack.
      • file.path: The file path for file-related events.
      • sock.conn: The TCP connection information for network-related events.
      • futex.op: The futex operation type, which can be wait or wake.

    Refer to the Supported Syscall Events section for details on the extended attributes supported by each event type.

Event Output

As an eBPF probe extension provided by gala-gopher, tprofiling sends generated system events to gala-gopher for processing. gala-gopher then outputs these events in the openTelemetry format and publishes them as JSON messages to a Kafka queue. Front-end applications can consume these tprofiling events by subscribing to the Kafka topic.

Here's an example of a thread profiling event output:

json
{
    "Timestamp": 1661088145000,
    "SeverityText": "INFO",
    "SeverityNumber": 9,
    "Body": "",
    "Resource": {
        "host.id": "",
        "host.name": "",
        "thread.pid": 10,
        "thread.tgid": 10,
        "thread.comm": "java",
        "proc.name": "xxx.jar",
        "container.id": "",
        "container.name": "",
    },
    "Attributes": {
        values: [
            {
                // common info
                "event.name": "read",
                "event.type": "file",
                "start_time": 1661088145000,
                "end_time": 1661088146000,
                "duration": 0.1,
                "count": 1,
                // extend info
                "func.stack": "read;",
                "file.path": "/test.txt"
            },
            {
                "event.name": "oncpu",
                "event.type": "oncpu",
                "start_time": 1661088146000,
                "end_time": 1661088147000,
                "duration": 0.1,
                "count": 1,
            }
        ]
    }
}

Key fields:

  • Timestamp: The timestamp when the event was reported.
  • Resource: Information about the event source.
  • Attributes: Event attribute information, containing a values list. Each item in the list represents a tprofiling event from the same source and includes the event's attributes.

Quick Start

Installation

tprofiling is an eBPF probe extension for gala-gopher, so you must first install gala-gopher before enabling tprofiling.

gala-ops provides a demo UI for tprofiling based on Kafka, Logstash, Elasticsearch, and Grafana. You can use the gala-ops deployment tools for quick setup.

Architecture

Software components:

  • Kafka: An open-source message queue that receives and stores tprofiling events collected by gala-gopher.
  • Logstash: A real-time, open-source log collection engine that consumes tprofiling events from Kafka, processes them (filtering, transformation, etc.), and sends them to Elasticsearch.
  • Elasticsearch: An open, distributed search and analytics engine that stores the processed tprofiling events for querying and visualization in Grafana.
  • Grafana: An open-source visualization tool to query and visualize the collected tprofiling events. Users interact with tprofiling through the Grafana UI to analyze application performance.

Deploying the tprofiling Probe

First, install gala-gopher as described in the gala-gopher documentation. Because tprofiling events are sent to Kafka, configure the Kafka service address during deployment.

After installing and running gala-gopher, start the tprofiling probe using gala-gopher's HTTP-based dynamic configuration API:

sh
curl -X PUT http://<gopher-node-ip>:9999/tprofiling -d json='{"cmd": {"probe": ["oncpu", "syscall_file", "syscall_net", "syscall_sched", "syscall_lock"]}, "snoopers": {"proc_name": [{"comm": "java"}]}, "state": "running"}'

Configuration parameters:

  • <gopher-node-ip>: The IP address of the node where gala-gopher is deployed.
  • probe: Under cmd, the probe configuration specifies the system events that the tprofiling probe monitors. oncpu, syscall_file, syscall_net, syscall_sched, and syscall_lock correspond to on-CPU events and file, network, scheduling, and lock syscall events, respectively. You can enable only the desired tprofiling event types.
  • proc_name: Under snoopers, the proc_name configuration filters the processes to monitor by process name. You can also filter by process ID using the proc_id configuration. See REST Dynamic Configuration Interface for details.

To stop the tprofiling probe, run:

sh
curl -X PUT http://<gopher-node-ip>:9999/tprofiling -d json='{"state": "stopped"}'

Deploying the Front-End Software

The tprofiling UI requires Kafka, Logstash, Elasticsearch, and Grafana. Install these components on a management node. You can use the gala-ops deployment tools for quick installation; see the Online Deployment Documentation.

On the management node, obtain the deployment script from the Online Deployment Documentation and run the following command to install Kafka, Logstash, and Elasticsearch with one command:

sh
sh deploy.sh middleware -K <deployment_node_management_IP_address> -E <deployment_node_management_IP_address> -A -p

Run the following command to install Grafana:

sh
sh deploy.sh grafana -P <Prometheus_server_address)> -E <Elasticsearch_server_address>

Usage

After completing the deployment, access A-Ops by browsing to http://[deployment_node_management_IP_address]:3000 and logging into Grafana. The default username and password are both admin.

After logging in, find the ThreadProfiling dashboard.

image-20230628155002410

Click to enter the tprofiling UI and explore its features.

image-20230628155249009

Use Cases

Case 1: Deadlock Detection

image-20230628095802499

The above diagram shows the thread profiling results of a deadlock demo process. The pie chart shows that lock events (in gray) consume a significant portion of the execution time. The lower section displays the thread profiling results for the entire process, with the vertical axis representing the sequence of profiling events for different threads. The java main thread remains blocked. The LockThd1 and LockThd2 service threads execute oncpu and file events, followed by simultaneous, long-duration lock events. Hovering over a lock event reveals that it triggers a futex syscall lasting 60 seconds.

image-20230628101056732

This suggests potential issues with LockThd1 and LockThd2. We can examine their thread profiling results in the thread view.

image-20230628102138540

This view displays the profiling results for each thread, with the vertical axis showing the sequence of events. LockThd1 and LockThd2 normally execute oncpu events, including file and lock events, periodically. However, around 10:17:00, they both execute a long futex event without any intervening oncpu events, indicating a blocked state. futex is a syscall related to user-space lock implementation, and its invocation often signals lock contention and potential blocking.

Based on this analysis, a deadlock likely exists between LockThd1 and LockThd2.

Case 2: Lock Contention Detection

image-20230628111119499

The above diagram shows the thread profiling results for a lock contention demo process. The process primarily executes lock, net, and oncpu events, involving three service threads. Between 11:05:45 and 11:06:45, the event execution times for all three threads increase significantly, indicating a potential performance problem. We can examine each thread's profiling results in the thread view, focusing on this period.

image-20230628112709827

By examining the event sequence for each thread, we can understand their activities:

  • Thread CompeteThd1: Periodically triggers short oncpu events, performing a calculation task. However, around 11:05:45, it begins triggering long oncpu events, indicating a time-consuming calculation.

    image-20230628113336435

  • Thread CompeteThd2: Periodically triggers short net events. Clicking on an event reveals that the thread is sending network messages via the write syscall, along with the TCP connection details. Similarly, around 11:05:45, it starts executing long futex events and becomes blocked, increasing the interval between write events.

    image-20230628113759887

    image-20230628114340386

  • Thread tcp-server: A TCP server that continuously reads client requests via the read syscall. Starting around 11:05:45, the read event execution time increases, indicating that it is waiting to receive network requests.

    image-20230628114659071

Based on this analysis, whenever CompeteThd1 performs a long oncpu operation, CompeteThd2 calls futex and enters a blocked state. Once CompeteThd1 completes the oncpu operation, CompeteThd2 acquires the CPU and performs the network write operation. This strongly suggests lock contention between CompeteThd1 and CompeteThd2. Because CompeteThd2 is waiting for a lock and cannot send network requests, the tcp-server thread spends most of its time waiting for read requests.

Topics

Supported System Call Events

When selecting system call events for monitoring, consider these principles:

  1. Choose potentially time-consuming or blocking events, such as file, network, or lock operations, as they involve system resource access.
  2. Choose events that affect a thread's running state.
Event/Syscall NameDescriptionDefault TypeExtended Content
readReads/writes to drive files or the network; may be time-consuming or blocking.filefile.path, sock.conn, func.stack
writeReads/writes to drive files or the network; may be time-consuming or blocking.filefile.path, sock.conn, func.stack
readvReads/writes to drive files or the network; may be time-consuming or blocking.filefile.path, sock.conn, func.stack
writevReads/writes to drive files or the network; may be time-consuming or blocking.filefile.path, sock.conn, func.stack
preadvReads/writes to drive files or the network; may be time-consuming or blocking.filefile.path, sock.conn, func.stack
pwritevReads/writes to drive files or the network; may be time-consuming or blocking.filefile.path, sock.conn, func.stack
syncSynchronously flushes files to the drive; blocks the thread until completion.filefunc.stack
fsyncSynchronously flushes files to the drive; blocks the thread until completion.filefile.path, sock.conn, func.stack
fdatasyncSynchronously flushes files to the drive; blocks the thread until completion.filefile.path, sock.conn, func.stack
sched_yieldThread voluntarily relinquishes the CPU for rescheduling.schedfunc.stack
nanosleepThread enters a sleep state.schedfunc.stack
clock_nanosleepThread enters a sleep state.schedfunc.stack
wait4Thread blocks.schedfunc.stack
waitpidThread blocks.schedfunc.stack
selectThread blocks and waits for an event.schedfunc.stack
pselect6Thread blocks and waits for an event.schedfunc.stack
pollThread blocks and waits for an event.schedfunc.stack
ppollThread blocks and waits for an event.schedfunc.stack
epoll_waitThread blocks and waits for an event.schedfunc.stack
sendtoReads/writes to the network; may be time-consuming or blocking.netsock.conn, func.stack
recvfromReads/writes to the network; may be time-consuming or blocking.netsock.conn, func.stack
sendmsgReads/writes to the network; may be time-consuming or blocking.netsock.conn, func.stack
recvmsgReads/writes to the network; may be time-consuming or blocking.netsock.conn, func.stack
sendmmsgReads/writes to the network; may be time-consuming or blocking.netsock.conn, func.stack
recvmmsgReads/writes to the network; may be time-consuming or blocking.netsock.conn, func.stack
futexOften indicates lock contention; the thread may block.lockfutex.op, func.stack

Aggregated Events

tprofiling currently supports two main categories of system performance events: system call events and oncpu events. In certain scenarios, oncpu events and some system call events (like read and write) can trigger frequently, generating a large volume of system events. This can negatively impact both the performance of the application being observed and the tprofiling probe itself.

To improve performance, tprofiling aggregates multiple system events with the same name from the same thread within a one-second interval into a single reported event. Therefore, a tprofiling event is actually an aggregated event containing one or more identical system events. Some attribute meanings differ between aggregated events and real system events:

  • start_time: The start time of the first system event in the aggregation.
  • end_time: Calculated as start_time + duration.
  • duration: The sum of the actual execution times of all system events in the aggregation.
  • count: The number of system events aggregated. When count is 1, the aggregated event is equivalent to a single system event.
  • Extended event attributes: The extended attributes of the first system event in the aggregation.

Introduction to L7Probe

Purpose: L7 traffic observation, covering common protocols like HTTP1.X, PG, MySQL, Redis, Kafka, HTTP2.0, MongoDB, and RocketMQ. Supports observation of encrypted streams.

Scope: Node, container, and Kubernetes pod environments.

Code Framework Design

text
L7Probe
    | --- included  // Public header files
        | --- connect.h   // L7 connect object definition
        | --- pod.h   // pod/container object definition
        | --- conn_tracker.h   // L7 protocol tracking object definition
    | --- protocol  // L7 protocol parsing
        | --- http   // HTTP1.X L7 message structure definition and parsing
        | --- mysql // mysql L7 message structure definition and parsing
        | --- pgsql // pgsql L7 message structure definition and parsing
    | --- bpf  // Kernel bpf code
        | --- L7.h   // BPF program parses L7 protocol types
        | --- kern_sock.bpf.c   // Kernel socket layer observation
        | --- libssl.bpf.c   // OpenSSL layer observation
        | --- gossl.bpf.c   // Go SSL layer observation
        | --- cgroup.bpf.c   // Pod lifecycle observation
    | --- pod_mng.c   // pod/container instance management (detects pod/container lifecycle)
    | --- conn_mng.c   // L7 Connect instance management (handles BPF observation events, such as Open/Close events, Stats statistics)
    | --- conn_tracker.c   // L7 traffic tracking (tracks data from BPF observation, such as data generated by send/write, read/recv system events)
    | --- bpf_mng.c   // BPF program lifecycle management (dynamically opens, loads, attaches, and unloads BPF programs, including uprobe BPF programs)
    | --- session_conn.c   // Manages JSSE sessions (records the mapping between JSSE sessions and socket connections, and reports JSSE connection information)
    | --- L7Probe.c   // Main probe program

Probe Output

Metric NameTable NameMetric TypeUnitMetric Description
tgidN/AKeyN/AProcess ID of the L7 session.
client_ipN/AKeyN/AClient IP address of the L7 session.
server_ipN/AKeyN/AServer IP address of the L7 session.
Note: In Kubernetes, Cluster IP addresses can be translated to Backend IP addresses.
server_portN/AKeyN/AServer port of the L7 session.
Note: In Kubernetes, Cluster Ports can be translated to Backend Ports.
l4_roleN/AKeyN/ARole of the L4 protocol (TCP Client/Server or UDP).
l7_roleN/AKeyN/ARole of the L7 protocol (Client or Server).
protocolN/AKeyN/AName of the L7 protocol (HTTP/HTTP2/MySQL...).
sslN/ALabelN/AIndicates whether the L7 session uses SSL encryption.
bytes_sentl7_linkGaugeN/ANumber of bytes sent by the L7 session.
bytes_recvl7_linkGaugeN/ANumber of bytes received by the L7 session.
segs_sentl7_linkGaugeN/ANumber of segments sent by the L7 session.
segs_recvl7_linkGaugeN/ANumber of segments received by the L7 session.
throughput_reql7_rpcGaugeQPSRequest throughput of the L7 session.
throughput_respl7_rpcGaugeQPSResponse throughput of the L7 session.
req_countl7_rpcGaugeN/ARequest count of the L7 session.
resp_countl7_rpcGaugeN/AResponse count of the L7 session.
latency_avgl7_rpcGaugensAverage latency of the L7 session.
latencyl7_rpcHistogramnsLatency histogram of the L7 session.
latency_suml7_rpcGaugensTotal latency of the L7 session.
err_ratiol7_rpcGauge%Error rate of the L7 session.
err_countl7_rpcGaugeN/AError count of the L7 session.

Dynamic Control

Controlling the Scope of Pod Observation

  1. REST request sent to gala-gopher.
  2. gala-gopher forwards the request to L7Probe.
  3. L7Probe identifies relevant containers based on the Pod information.
  4. L7Probe retrieves the CGroup ID (cpuacct_cgrp_id) of each container and writes it to the object module (using the cgrp_add API).
  5. During socket system event processing, the CGroup (cpuacct_cgrp_id) of the process is obtained, referencing the Linux kernel code (task_cgroup).
  6. Filtering occurs during observation via the object module (using the is_cgrp_exist API).

Controlling Observation Capabilities

  1. REST request sent to gala-gopher.
  2. gala-gopher forwards the request to L7Probe.
  3. L7Probe dynamically enables or disables BPF-based observation features (including throughput, latency, tracing, and protocol type detection) based on the request parameters.

Observation Points

Kernel Socket System Calls

TCP-related system calls:

c
// int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
// int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
// int accept4(int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags);
// ssize_t write(int fd, const void *buf, size_t count);
// ssize_t send(int sockfd, const void *buf, size_t len, int flags);
// ssize_t read(int fd, void *buf, size_t count);
// ssize_t recv(int sockfd, void *buf, size_t len, int flags);
// ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
// ssize_t readv(int fd, const struct iovec *iov, int iovcnt);

TCP and UDP-related system calls:

c
// ssize_t sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen);
// ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags, struct sockaddr *src_addr, socklen_t *addrlen);
// ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags);
// ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags);
// int close(int fd);

Important notes:

  1. read/write and readv/writev can be confused with regular file I/O. The kernel function security_socket_sendmsg is observed to determine if a file descriptor (FD) refers to a socket operation.
  2. sendto/recvfrom and sendmsg/recvmsg are used by both TCP and UDP. Refer to the manuals below.
  3. sendmmsg/recvmmsg and sendfile are not currently supported.

sendto manual: If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0), and the error ENOTCONN is returned when the socket was not actually connected. otherwise, the address of the target is given by dest_addr with addrlen specifying its size.

sendto determines that the protocol is TCP if the dest_addr parameter is NULL; otherwise, it is UDP.

recvfrom manual: The recvfrom() and recvmsg() calls are used to receive messages from a socket, and may be used to receive data on a socket whether or not it is connection-oriented.

recvfrom determines that the protocol is TCP if the src_addr parameter is NULL; otherwise, it is UDP.

sendmsg manual: The sendmsg() function shall send a message through a connection-mode or connectionless-mode socket. If the socket is a connectionless-mode socket, the message shall be sent to the address specified by msghdr if no pre-specified peer address has been set. If a peer address has been pre-specified, either themessage shall be sent to the address specified in msghdr (overriding the pre-specified peer address), or the function shall return -1 and set errno to [EISCONN]. If the socket is connection-mode, the destination address in msghdr shall be ignored.

sendmsg determines that the protocol is TCP if msghdr->msg_name is NULL; otherwise, it is UDP.

recvmsg manual: The recvmsg() function shall receive a message from a connection-mode or connectionless-mode socket. It is normally used with connectionless-mode sockets because it permits the application to retrieve the source address of received data.

recvmsg determines that the protocol is TCP if msghdr->msg_name is NULL; otherwise, it is UDP.

libSSL API

SSL_write

SSL_read

Go SSL API

JSSE API

sun/security/ssl/SSLSocketImpl$AppInputStream

sun/security/ssl/SSLSocketImpl$AppOutputStream

JSSE Observation Scheme

Loading the JSSEProbe

The l7_load_jsse_agent function in main loads the JSSEProbe.

It polls processes in the whitelist (g_proc_obj_map_fd). If a process is a Java process, it uses jvm_attach to load JSSEProbeAgent.jar into it. After loading, the Java process outputs observation information to /tmp/java-data-<pid>/jsse-metrics.txt at specific points (see JSSE API).

Processing JSSEProbe Messages

The l7_jsse_msg_handler thread handles JSSEProbe messages.

It polls processes in the whitelist (g_proc_obj_map_fd). If a process has a jsse-metrics output file, it reads the file line by line, then parses, converts, and reports JSSE read/write information.

1. Parsing JSSE Read/Write Information

The jsse-metrics.txt output format is:

text
|jsse_msg|662220|Session(1688648699909|TLS_AES_256_GCM_SHA384)|1688648699989|Write|127.0.0.1|58302|This is test message|

It parses the process ID, session ID, time, read/write operation, IP address, port, and payload.

The parsed information is stored in session_data_args_s.

2. Converting JSSE Read/Write Information

It converts the information in session_data_args_s into sock_conn and conn_data.

This conversion queries two hash maps:

session_head: Records the mapping between the JSSE session ID and the socket connection ID. If the process ID and 4-tuple information match, the session and socket connection are linked.

file_conn_head: Records the last session ID of the Java process, in case L7Probe doesn't start reading from the beginning of a request and can't find the session ID.

3. Reporting JSSE Read/Write Information

It reports sock_conn and conn_data to the map.

sliprobe Introduction

sliprobe uses eBPF to collect and report container-level service-level indicator (SLI) metrics periodically.

Features

  • Collects the total latency and statistical histogram of CPU scheduling events per container. Monitored events include scheduling wait, active sleep, lock/IO blocking, scheduling delay, and long system calls.
  • Collects the total latency and statistical histogram of memory allocation events per container. Monitored events include memory reclamation, swapping, and memory compaction.
  • Collects the total latency and statistical histogram of BIO layer I/O operations per container.

Usage Instructions

Example command to start sliprobe: Specifies a reporting period of 15 seconds and observes SLI metrics for containers abcd12345678 and abcd87654321.

shell
curl -X PUT http://localhost:9999/sli -d json='{"params":{"report_period":15}, "snoopers":{"container_id":[{"container_id": "abcd12345678","abcd87654321"}]}, "state":"running"}'

Code Logic

Overview

  1. The user-space application receives a list of containers to monitor and stores the inode of each container's cpuacct subsystem directory in an eBPF map, sharing it with the kernel.
  2. The kernel traces relevant kernel events using eBPF kprobes/tracepoints, determines if the event belongs to a monitored container, and records the event type and timestamp. It aggregates and reports SLI metrics for processes in the same cgroup at regular intervals.
  3. The user-space application receives and prints the SLI metrics reported by the kernel.

How SLI Metrics Are Calculated

CPU SLI
  1. cpu_wait

    At the sched_stat_wait tracepoint, get the delay value (second parameter).

  2. cpu_sleep

    At the sched_stat_sleep tracepoint, get the delay value (second parameter).

  3. cpu_iowait

    At the sched_stat_blocked tracepoint, if the current process is in_iowait, get the delay value (second parameter).

  4. cpu_block

    At the sched_stat_blocked tracepoint, if the current process is not in_iowait, get the delay value (second parameter).

  5. cpu_rundelay

    At the sched_switch tracepoint, get the run_delay value of the next scheduled process (next->sched_info.run_delay) from the third parameter next and store it in task_sched_map. Calculate the difference in run_delay between two scheduling events of the same process.

  6. cpu_longsys

    At the sched_switch tracepoint, get the task structure of the next scheduled process from the third parameter next. Obtain the number of context switches (nvcsw+nivcsw) and user-space execution time (utime) from the task structure. If the number of context switches and user-space execution time remain the same between two scheduling events of the same process, the process is assumed to be executing a long system call. Accumulate the time the process spends in kernel mode.

MEM SLI
  1. mem_reclaim

    Calculate the difference between the return and entry timestamps of the mem_cgroup_handle_over_high function.

    Calculate the difference between the timestamps of the mm_vmscan_memcg_reclaim_end and mm_vmscan_memcg_reclaim_begin tracepoints.

  2. mem_swapin

    Calculate the difference between the return and entry timestamps of the do_swap_page function.

  3. mem_compact

    Calculate the difference between the return and entry timestamps of the try_to_compact_pages function.

IO SLI
  1. bio_latency

    Calculate the timestamp difference between entering the bio_endio function and triggering the block_bio_queue tracepoint.

    Calculate the timestamp difference between entering the bio_endio function and exiting the generic_make_request_checks function.

Output Data

  • Metric

    Prometheus Server has a built-in Express Browser UI. You can use PromQL statements to query metric data. For details, see Using the expression browser in the official document. The following is an example.

    If the specified metric is gala_gopher_tcp_link_rcv_rtt, the metric data displayed on the UI is as follows:

    basic
    gala_gopher_tcp_link_rcv_rtt{client_ip="x.x.x.165",client_port="1234",hostname="openEuler",instance="x.x.x.172:8888",job="prometheus",machine_id="1fd3774xx",protocol="2",role="0",server_ip="x.x.x.172",server_port="3742",tgid="1516"} 1
  • Metadata

    You can directly consume data from the Kafka topic gala_gopher_metadata. The following is an example.

    bash
    # Input request
    ./bin/kafka-console-consumer.sh --bootstrap-server x.x.x.165:9092 --topic gala_gopher_metadata
    # Output data
    {"timestamp": 1655888408000, "meta_name": "thread", "entity_name": "thread", "version": "1.0.0", "keys": ["machine_id", "pid"], "labels": ["hostname", "tgid", "comm", "major", "minor"], "metrics": ["fork_count", "task_io_wait_time_us", "task_io_count", "task_io_time_us", "task_hang_count"]}
  • Abnormal events

    You can directly consume data from the Kafka topic gala_gopher_event. The following is an example.

    bash
    # Input request
    ./bin/kafka-console-consumer.sh --bootstrap-server x.x.x.165:9092 --topic gala_gopher_event
    # Output data
    {"timestamp": 1655888408000, "meta_name": "thread", "entity_name": "thread", "version": "1.0.0", "keys": ["machine_id", "pid"], "labels": ["hostname", "tgid", "comm", "major", "minor"], "metrics": ["fork_count", "task_io_wait_time_us", "task_io_count", "task_io_time_us", "task_hang_count"]}