LTS

    Innovation Version

      sysmonitor

      Introduction

      The system monitor (sysmonitor) daemon monitors exceptions that occur during OS running and records the exceptions in the system log file /var/log/sysmonitor.log. sysmonitor runs as a service. You can run the systemctl start|stop|restart|reload sysmonitor command to start, stop, restart, and reload the service. You are advised to deploy sysmonitor to locate system exceptions.

      Precautions

      • sysmonitor cannot run concurrently.
      • Ensure that all configuration files are valid. Otherwise, the monitoring service may be abnormal.
      • The root privilege is required for sysmonitor service operations, configuration file modification, and log query. The root user has the highest permission in the system. When performing operations as the root user, follow the operation guide to avoid system management and security risks caused by improper operations.

      Configuration Overview

      Configuration file /etc/sysconfig/sysmonitor of sysmonitor defines the monitoring period of each monitoring item and specifies whether to enable monitoring. Spaces are not allowed between the configuration item, equal sign (=), and configuration value, for example, PROCESS_MONITOR="on".

      Configuration description

      ItemDescriptionMandatoryDefault Value
      PROCESS_MONITORWhether to enable key process monitoring. The value can be on or off.Noon
      PROCESS_MONITOR_PERIODMonitoring period on key processes, in seconds.No3
      PROCESS_RECALL_PERIODInterval for attempting to restart a key process after the process fails to be recovered, in minutes. The value can be an integer ranging from 1 to 1440.No1
      PROCESS_RESTART_TIMEOUTTimeout interval for recovering a key process service from an exception, in seconds. The value can be an integer ranging from 30 to 300.No90
      PROCESS_ALARM_SUPRESS_NUMNumber of alarm suppression times when the key process monitoring configuration uses the alarm command to report alarms. The value is a positive integer.No5
      FILESYSTEM_MONITORWhether to enable ext3 and ext4 file system monitoring. The value can be on or off.Noon
      DISK_MONITORWhether to enable drive partition monitoring. The value can be on or off.Noon
      DISK_MONITOR_PERIODDrive monitoring period, in seconds.No60
      INODE_MONITORWhether to enable drive inode monitoring. The value can be on or off.Noon
      INODE_MONITOR_PERIODDrive inode monitoring period, in seconds.No60
      NETCARD_MONITORWhether to enable NIC monitoring. The value can be on or off.Noon
      FILE_MONITORWhether to enable file monitoring. The value can be on or off.Noon
      CPU_MONITORWhether to enable CPU monitoring. The value can be on or off.Noon
      MEM_MONITORWhether to enable memory monitoring. The value can be on or off.Noon
      PSCNT_MONITORWhether to enable process count monitoring. The value can be on or off.Noon
      FDCNT_MONITORWhether to enable file descriptor (FD) count monitoring. The value can be on or off.Noon
      CUSTOM_DAEMON_MONITORWhether to enable custom daemon item monitoring. The value can be on or off.Noon
      CUSTOM_PERIODIC_MONITORWhether to enable custom periodic item monitoring. The value can be on or off.Noon
      IO_DELAY_MONITORWhether to enable local drive I/O latency monitoring. The value can be on or off.Nooff
      PROCESS_FD_NUM_MONITORWhether to enable process FD count monitoring. The value can be on or off.Noon
      PROCESS_MONITOR_DELAYWhether to wait until all monitoring items are normal when sysmonitor is started. The value can be on or off.Noon
      NET_RATE_LIMIT_BURSTNIC route information printing rate, that is, the number of logs printed per second.No5
      Valid range: 0 to 100
      FD_MONITOR_LOG_PATHFD monitoring log fileNo/var/log/sysmonitor.log
      ZOMBIE_MONITORWhether to monitor zombie processesNooff
      CHECK_THREAD_MONITORWhether to enable internal thread self-healing. The value can be on or off.Noon
      CHECK_THREAD_FAILURE_NUMNumber of internal thread self-healing checks in a period.No3
      Valid range: 2 to 10
      • After modifying the /etc/sysconfig/sysmonitor configuration file, restart the sysmonitor service for the configurations to take effect.
      • If an item is not configured in the configuration file, it is enabled by default.
      • After the internal thread self-healing function is enabled, if a sub-thread of the monitoring item is suspended and the number of checks in a period exceeds the configured value, the sysmonitor service is restarted for restoration. The configuration is reloaded. The configured key process monitoring and customized monitoring are restarted. If this function affects user experience, you can disable it.

      Command Reference

      • Start sysmonitor.
      systemctl start sysmonitor
      
      • Stop sysmonitor.
      systemctl stop sysmonitor
      
      • Restart sysmonitor.
      systemctl restart sysmonitor
      
      • Reload sysmonitor for the modified configurations to take effect.
      systemctl reload sysmonitor
      

      Monitoring Logs

      By default, logs is split and dumped to prevent the sysmonitor.log file from getting to large. Logs are dumped to a drive directory. In this way, a certain number of logs can be retained.

      The configuration file is /etc/rsyslog.d/sysmonitor.conf. Because this rsyslog configuration file is added, after sysmonitor is installed for the first time, you need to restart the rsyslog service to make the sysmonitor log configuration take effect.

      $template sysmonitorformat,"%TIMESTAMP:::date-rfc3339%|%syslogseverity-text%|%msg%\n"
      
      $outchannel sysmonitor, /var/log/sysmonitor.log, 2097152, /usr/libexec/sysmonitor/sysmonitor_log_dump.sh
      if ($programname == 'sysmonitor' and $syslogseverity <= 6) then {
      :omfile:$sysmonitor;sysmonitorformat
      stop
      }
      
      if ($msg contains 'Time has been changed') then {
      :omfile:$sysmonitor;sysmonitorformat
      stop
      }
      
      if ($programname == 'sysmonitor' and $syslogseverity > 6) then {
      /dev/null
      stop
      }
      

      ext3/ext4 Filesystem Monitoring

      Introduction

      A fault in the filesystem may trigger I/O operation errors, which further cause OS faults. File system fault detection can detect the faults in real time so that system administrators or users can rectify them in a timely manner.

      Configuration File Description

      None

      Exception Logs

      For a file system to which the errors=remount-ro mounting option is added, if the ext3 or ext4 file system is faulty, the following exception information is recorded in the sysmonitor.log file:

      info|sysmonitor[127]: loop0 filesystem error. Remount filesystem read-only.
      

      In other exception scenarios, if the ext3 or ext4 file system is faulty, the following exception information is recorded in the sysmonitor.log file:

      info|sysmonitor[127]: fs_monitor_ext3_4: loop0 filesystem error. flag is 1879113728.
      

      Key Processing Monitoring

      Introduction

      Key processes in the system are periodically monitored. When a key process exits abnormally, sysmonitor automatically attempts to recover the key process. If the recovery fails, alarms can be reported. The system administrator can be promptly notified of the abnormal process exit event and whether the process is restarted. Fault locating personnel can locate the time when the process exits abnormally from logs.

      Configuration File Description

      The configuration file directory is /etc/sysmonitor/process. Each process or module corresponds to a configuration file.

      USER=root
      NAME=irqbalance
      RECOVER_COMMAND=systemctl restart irqbalance
      MONITOR_COMMAND=systemctl status irqbalance
      STOP_COMMAND=systemctl stop irqbalance
      

      The configuration items are as follows:

      ItemDescriptionMandatoryDefault Value
      NAMEProcess or module nameYesNone
      RECOVER_COMMANDRecovery commandNoNone
      MONITOR_COMMANDMonitoring command
      If the command output is 0, the process is normal. If the command output is greater than 0, the process is abnormal.
      Nopgrep -f $(which xxx)
      xxx is the process name configured in the NAME field.
      STOP_COMMANDStopping commandNoNone
      USERUser name
      User for executing the monitoring, recovery, and stopping commands or scripts
      NoIf this item is left blank, the root user is used by default.
      CHECK_AS_PARAMParameter passing switch
      If this item is on, the return value of MONITOR_COMMAND is transferred to the RECOVER_COMMAND command or script as an input parameter. If this item is set to off or other values, the function is disabled.
      NoNone
      MONITOR_MODEMonitoring mode
      - parallel or serial
      Noserial
      MONITOR_PERIODMonitoring period
      - Parallel monitoring period
      - This item does not take effect when the monitoring mode is serial.
      No3
      USE_CMD_ALARMAlarm mode
      If this parameter is set to on or ON, alarms are reported using the alarm reporting command.
      NoNone
      ALARM_COMMANDAlarm reporting commandNoNone
      ALARM_RECOVER_COMMANDAlarm recovery commandNoNo
      • After modifying the configuration file for monitoring key processes, run systemctl reload sysmonitor. The new configuration takes effect after a monitoring period.
      • The recovery command and monitoring command must not block. Otherwise, the monitoring thread of the key process becomes abnormal.
      • When the recovery command is executed for more than 90 seconds, the stopping command is executed to stop the process.
      • If the recovery command is empty or not configured, the monitoring command does not attempt to recover the key process when detecting that the key process is abnormal.
      • If a key process is abnormal and fails to be started for three consecutive times, the process is started based on the period specified by PROCESS_RECALL_PERIOD in the global configuration file.
      • If the monitored process is not a daemon process, MONITOR_COMMAND is mandatory.
      • If the configured key service does not exist in the current system, the monitoring does not take effect and the corresponding information is printed in the log. If a fatal error occurs in other configuration items, the default configuration is used and no error is reported.
      • The permission on the configuration file is 600. You are advised to set the monitoring item to the service type of systemd (for example, MONITOR_COMMAND=systemctl status irqbalance). If a process is monitored, ensure that the NAME field is an absolute path.
      • The restart, reload, and stop of sysmonitor do not affect the monitored processes or services.
      • If USE_CMD_ALARM is set to on, you must ensure the validiy of ALARM_COMMAND and ALARM_RECOVER_COMMAND. If ALARM_COMMAND or ALARM_RECOVER_COMMAND is empty or not configured, no alarm is reported.
      • The security of user-defined commands, such as the monitoring, recovery, stopping, alarm reporting, and alarm recovery commands, is ensured by users. Commands are executed by the user root. You are advised to set the script command permission to be used only by the user root to prevent privilege escalation for common users.
      • If the length of the monitoring command cannot be greater than 200 characters. Otherwise, the process monitoring fails to be added.
      • When the recovery command is set to a systemd service restart command (for example, RECOVER_COMMAND=systemctl restart irqbalance), check whether the recovery command conflicts with the open source systemd service recovery mechanism. Otherwise, the behavior of key processes may be affected after exceptions occur.
      • The processes started by the sysmonitor service are in the same cgroup as the sysmonitor service, and resources cannot be restricted separately. Therefore, you are advised to use the open source systemd mechanism to recover the processes.

      Exception Logs

      • RECOVER_COMMAND configured

        If a process or module exception is detected, the following exception information is recorded in the /var/log/sysmonitor.log file:

        info|sysmonitor[127]: irqbalance is abnormal, check cmd return 1, use "systemctl restart irqbalance" to recover
        

        If the process or module recovers, the following information is recorded in the /var/log/sysmonitor.log file:

        info|sysmonitor[127]: irqbalance is recovered
        
      • RECOVER_COMMAND not configured

        If a process or module exception is detected, the following exception information is recorded in the /var/log/sysmonitor.log file:

        info|sysmonitor[127]: irqbalance is abnormal, check cmd return 1, recover cmd is null, will not recover
        

        If the process or module recovers, the following information is recorded in the /var/log/sysmonitor.log file:

        info|sysmonitor[127]: irqbalance is recovered
        

      File Monitoring

      Introduction

      If key system files are deleted accidentally, the system may run abnormally or even break down. Through file monitoring, you can learn about the deletion of key files or the addition of malicious files in the system in a timely manner, so that administrators and users can learn and rectify faults in a timely manner.

      Configuration File Description

      The configuration file is /etc/sysmonitor/file. Each monitoring configuration item occupies a line. A monitoring configuration item contains the file (directory) and event to be monitored. The file (directory) to be monitored is an absolute path. The file (directory) to be monitored and the event to be monitored are separated by one or more spaces.

      The file monitoring configuration items can be added to the /etc/sysmonitor/file.d directory. The configuration method is the same as that of the /etc/sysmonitor/file directory.

      • Due to the log length limit, it is recommended that the absolute path of a file or directory be less than 223 characters. Otherwise, the printed logs may be incomplete.

      • Ensure that the path of the monitored file is correct. If the configured file does not exist or the path is incorrect, the file cannot be monitored.

      • Due to the path length limit of the system, the absolute path of the monitored file or directory must be less than 4096 characters.

      • Directories and regular files can be monitored. /proc, /proc/*, /dev, /dev/*, /sys, /sys/*, pipe files, or socket files cannot be monitored.

      • Only deletion events can be monitored in /var/log and /var/log/*.

      • If multiple identical paths exist in the configuration file, the first valid configuration takes effect. In the log file, you can see messages indicating that the identical paths are ignored.

      • Soft links cannot be monitored. When a hard link file deletion event is configured, the event is printed only after the file and all its hard links are deleted.

      • When a monitored event occurs after the file monitoring is successfully added, the monitoring log records the absolute path of the configured file.

      • Currently, directories cannot be monitored recursively. The configured directory is monitored but not its subdirectories.

      • The events to be monitored are configured using bitmaps as follows.

        -------------------------------
        | 11~32   | 10   | 9   |  1~8 | 
        -------------------------------
      

      Each bit in the event bitmap represents an event. If bit n is set to 1, the event corresponding to bit n is monitored. The hexadecimal number corresponding to the monitoring bitmap is the event monitoring item written to the configuration file.

      ItemDescriptionMandatory
      1~8ReservedNo
      9File or directory addition eventYes
      10File or directory deletion eventYes
      11~32ReservedNo
      • After modifying the file monitoring configuration file, run systemctl reload sysmonitor. The new configuration takes effect within 60 seconds.
      • Strictly follow the preceding rules to configure events to be monitored. If the configuration is incorrect, the events cannot be monitored. If an event to be monitored in the configuration item is empty, only the deletion event is monitored by default, that is, 0x200.
      • After a file or directory is deleted, the deletion event is reported only when all processes that open the file stop.
      • If a monitored a is modified by vi or sed, "File XXX may have been changed" is recorded in the monitoring log.
      • Currently, file addition and deletion events can be monitored, that is, the ninth and tenth bits take effect. Other bits are reserved and do not take effect. If a reserved bit is configured, the monitoring log displays a message indicating that the event monitoring is incorrectly configured.

      Example

      Monitor the subdirectory addition and deletion events in /home. The lower 12-bit bitmap is 001100000000. The configuration is as follows:

      /home 0x300
      

      Monitor the file deletion events of /etc/ssh/sshd_config. The lower 12-bit bitmap is 001000000000. The configuration is as follows:

      /etc/sshd/sshd_config 0x200
      

      Exception Logs

      If a configured event occurs to the monitored file, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]: 1 events queued
      info|sysmonitor[127]: 1th events handled
      info|sysmonitor[127]: Subfile "111" under "/home" was added.
      

      Drive Partition Monitoring

      Introduction

      The system periodically monitors the drive partitions mounted to the system. When the drive partition usage is greater than or equal to the configured alarm threshold, the system records a drive space alarm. When the drive partition usage falls below the configured alarm recovery threshold, a drive space recovery alarm is recorded.

      Configuration File Description

      The configuration file is /etc/sysmonitor/disk.

      DISK="/var/log"  ALARM="90" RESUME="80"
      DISK="/" ALARM="95" RESUME="85"
      
      ItemDescriptionMandatoryDefault Value
      DISKMount directoryYesNone
      ALARMInteger indicating the drive space alarm thresholdNo90
      RESUMEInteger indicating the drive space alarm recovery thresholdNo80
      • After modifying the configuration file for drive space monitoring, run systemctl reload sysmonitor. The new configuration takes effect after a monitoring period.
      • If a mount directory is configured repeatedly, the last configuration item takes effect.
      • The value of ALARM must be greater than that of RESUME.
      • Only the mount point or the drive partition of the mount point can be monitored.
      • When the CPU usage and I/O usage are high, the df command execution may time out. As a result, the drive usage cannot be obtained.
      • If a drive partition is mounted to multiple mount points, an alarm is reported for each mount point.

      Exception Logs

      If a drive space alarm is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      warning|sysmonitor[127]: report disk alarm, /var/log used:90% alarm:90%
      info|sysmonitor[127]: report disk recovered, /var/log used:4% resume:10%
      

      NIC Status Monitoring

      Introduction

      During system running, the NIC status or IP address may change due to human factors or exceptions. You can monitor the NIC status and IP address changes to detect exceptions in a timely manner and locate exception causes.

      Configuration File Description

      The configuration file is /etc/sysmonitor/network.

      #dev event
      eth1 UP
      

      The following table describes the configuration items.

      ItemDescriptionMandatoryDefault Value
      devNIC nameYesNone
      eventEvent to be monitored. The value can be UP, DOWN, NEWADDR, or DELADDR.
      - UP: The NIC is up.
      - DOWN: The NIC is down.
      - NEWADDR: An IP address is added.
      - DELADDR: An IP address is deleted.
      NoIf this item is empty, UP, DOWN, NEWADDR, and DELADDR are monitored.
      • After modifying the configuration file for NIC monitoring, run systemctl reload sysmonitor for the new configuration to take effect.
      • The UP and DOWN status of virtual NICs cannot be monitored.
      • Ensure that each line in the NIC monitoring configuration file contains less than 4096 characters. Otherwise, a configuration error message will be recorded in the monitoring log.
      • By default, all events of all NICs are monitored. That is, if no NIC monitoring is configured, the UP, DOWN, NEWADDR, and DELADDR events of all NICs are monitored.
      • If a NIC is configured but no event is configured, all events of the NIC are monitored by default.
      • The events of route addition can be recorded five times per second. You can change the number of times by setting NET_RATE_LIMIT_BURST in /etc/sysconfig/sysmonitor.

      Exception Logs

      If a NIC event is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]: lo: ip[::1] prefixlen[128] is added, comm: (ostnamed)[1046], parent comm: syst    emd[1]
      info|sysmonitor[127]: lo: device is up, comm: (ostnamed)[1046], parent comm: systemd[1]
      

      If a route event is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[881]: Fib4 replace table=255 192.168.122.255/32, comm: daemon-init[1724], parent com    m: systemd[1]
      info|sysmonitor[881]: Fib4 replace table=254 192.168.122.0/24, comm: daemon-init[1724], parent comm:     systemd[1]
      info|sysmonitor[881]: Fib4 replace table=255 192.168.122.0/32, comm: daemon-init[1724], parent comm:     systemd[1]
      info|sysmonitor[881]: Fib6 replace fe80::5054:ff:fef6:b73e/128, comm: kworker/1:3[209], parent comm:     kthreadd[2]
      

      CPU Monitoring

      Introduction

      The system monitors the global CPU usage or the CPU usage in a specified domain. When the CPU usage exceeds the configured alarm threshold, the system runs the configured log collection command.

      Configuration File Description

      The configuration file is /etc/sysmonitor/cpu.

      When the global CPU usage of the system is monitored, an example of the configuration file is as follows:

      # cpu usage alarm percent
      ALARM="90"
      
      # cpu usage alarm resume percent
      RESUME="80"
      
      # monitor period (second)
      MONITOR_PERIOD="60"
      
      # stat period (second)
      STAT_PERIOD="300"
      
      # command executed when cpu usage exceeds alarm percent
      REPORT_COMMAND=""
      

      When the CPU usage of a specific domain is monitored, an example of the configuration file is as follows:

      # monitor period (second)
      MONITOR_PERIOD="60"
      
      # stat period (second)
      STAT_PERIOD="300"
      
      DOMAIN="0,1"  ALARM="90" RESUME="80"
      DOMAIN="2,3"  ALARM="50" RESUME="40"
      
      # command executed when cpu usage exceeds alarm percent
      REPORT_COMMAND=""
      
      ItemDescriptionMandatoryDefault Value
      ALARMNumber greater than 0, indicating the CPU usage alarm thresholdNo90
      RESUMENumber greater than or equal to 0, indicating the CPU usage alarm recovery thresholdNo80
      MONITOR_PERIODMonitoring period, in seconds. The value is greater than 0.No60
      STAT_PERIODStatistical period, in seconds. The value is greater than 0.No300
      DOMAINCPU IDs in the domain, represented by decimal numbers
      - CPU IDs can be enumerated and separated by commas, for exmaple, 1,2,3. CPU IDs can be specified as a range in the formate of X-Y, for example, 0-2. The two representations can be used together, for example, 0, 1, 2-3 or 0-1, 2-3. Spaces or other characters are not allowed.
      - Each monitoring domain has an independent configuration item. Each configuration item supports a maximum of 256 CPUs. A CPU ID must be unique in a domain and across domains.
      NoNone
      REPORT_COMMANDCommand for collecting logs after the CPU usage exceeds the alarm thresholdNoNone
      • After modifying the configuration file for CPU monitoring, run systemctl reload sysmonitor. The new configuration takes effect after a monitoring period.
      • The value of ALARM must be greater than that of RESUME.
      • After the CPU domain monitoring is configured, the global average CPU usage of the system is not monitored, and the separately configured ALARM and RESUME values do not take effect.
      • If the configuration of a monitoring domain is invalid, CPU monitoring is not performed at all.
      • All CPUs configured in DOMAIN must be online. Otherwise, the domain cannot be monitored.
      • The command of REPORT_COMMAND cannot contain insecure characters such as &, ;, and >, and the total length cannot exceed 159 characters. Otherwise, the command cannot be executed.
      • Ensure the security and validity of REPORT_COMMAND. sysmonitor is responsible only for running the command as the root user.
      • REPORT_COMMAND must not block. When the execution time of the command exceeds 60s, the sysmonitor forcibly stops the execution.
      • Even if the CPU usage of multiple domains exceeds the threshold in a monitoring period, REPORT_COMMAND is executed only once.

      Exception Logs

      If a global CPU usage alarm is detected or cleared and the log collection command is configured, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]: CPU usage alarm: 91.3%
      info|sysmonitor[127]: cpu monitor: execute REPORT_COMMAND[sysmoniotrcpu] sucessfully
      info|sysmonitor[127]: CPU usage resume 70.1%
      

      If a domain average CPU usage alarm is detected or cleared and the log collection command is configured, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]: CPU 1,2,3 usage alarm: 91.3%
      info|sysmonitor[127]: cpu monitor: execute REPORT_COMMAND[sysmoniotrcpu] sucessfully
      info|sysmonitor[127]: CPU 1,2,3 usage resume 70.1%
      

      Memory Monitoring

      Introduction

      Monitors the system memory usage and records logs when the memory usage exceeds or falls below the threshold.

      Configuration File Description

      The configuration file is /etc/sysmonitor/memory.

      # memory usage alarm percent
      ALARM="90"
      
      # memory usage alarm resume percent
      RESUME="80"
      
      # monitor period(second)
      PERIOD="60"
      

      Configuration Item Description

      ItemDescriptionMandatoryDefault Value
      ALARMNumber greater than 0, indicating the memory usage alarm thresholdNo90
      RESUMENumber greater than or equal to 0, indicating the memory usage alarm recovery thresholdNo80
      PERIODMonitoring period, in seconds. The value is greater than 0.No60
      • After modifying the configuration file for memory monitoring, run systemctl reload sysmonitor. The new configuration takes effect after a monitoring period.
      • The value of ALARM must be greater than that of RESUME.
      • The average memory usage in three monitoring periods is used to determine whether an alarm is reported or cleared.

      Exception Logs

      If a memory alarm is detected, sysmonitor obtains the /proc/meminfo information and prints the information in the /var/log/sysmonitor.log file. The information is as follows:

      info|sysmonitor[127]: memory usage alarm: 90%
      info|sysmonitor[127]:---------------show /proc/meminfo: ---------------
      info|sysmonitor[127]:MemTotal: 3496388 kB
      info|sysmonitor[127]:MemFree: 2738100 kB
      info|sysmonitor[127]:MemAvailable: 2901888 kB
      info|sysmonitor[127]:Buffers: 165064 kB
      info|sysmonitor[127]:Cached: 282360 kB
      info|sysmonitor[127]:SwapCached: 4492 kB
      ......
      info|sysmonitor[127]:---------------show_memory_info end. ---------------
      

      If the following information is printed, sysmonitor runs echo m > /proc/sysrq-trigger to export memory allocation information. You can view the information in /var/log/messages.

      info|sysmonitor[127]: sysrq show memory ifno in message.
      

      When the alarm is recovered, the following information is displayed:

      info|sysmonitor[127]: memory usage resume: 4.6%
      

      Process and Thread Monitoring

      Introduction

      Monitors the number of processes and threads. When the total number of processes or threads exceeds or falls below the threshold, a log is recorded or an alarm is reported.

      Configuration File Description

      The configuration file is /etc/sysmonitor/pscnt.

      # number of processes(include threads) when alarm occur
      ALARM="1600"
      
      # number of processes(include threads) when alarm resume
      RESUME="1500"
      
      # monitor period(second)
      PERIOD="60"
      
      # process count usage alarm percent
      ALARM_RATIO="90"
      
      # process count usage resume percent
      RESUME_RATIO="80"
      
      # print top process info with largest num of threads when threads alarm
      # (range: 0-1024, default: 10, monitor for thread off:0)
      SHOW_TOP_PROC_NUM="10"
      
      ItemDescriptionMandatoryDefault Value
      ALARMInteger greater than 0, indicating the process count alarm thresholdNo1600
      RESUMEInteger greater than or equal to 0, indicating the process count alarm recovery thresholdNo1500
      PERIODMonitoring period, in seconds. The value is greater than 0.No60
      ALARM_RATIONumber greater than 0 and less than or equal to 100. Process count alarm threshold.No90
      RESUME_RATIONumber greater than 0 and less than or equal to 100. Process count alarm recovery threshold, which must be less than ALARM_RATIO.No80
      SHOW_TOP_PROC_NUMWhether to use the latest top information about threadsNo10
      • After modifying the configuration file for process count monitoring, run systemctl reload sysmonitor. The new configuration takes effect after a monitoring period.
      • The value of ALARM must be greater than that of RESUME.
      • The process count alarm threshold is the larger between ALARM and ALARM_RATIO in /proc/sys/kernel/pid_max. The alarm recovery threshold is the larger of RESUME and RESUME_RATIO in /proc/sys/kernel/pid_max.
      • The thread count alarm threshold is the larger between ALARM and ALARM_RATIO in /proc/sys/kernel/threads-max. The alarm recovery threshold is the larger of RESUME and RESUME_RATIO in /proc/sys/kernel/threads-max.
      • The value of SHOW_TOP_PROC_NUM ranges from 0 to 1024. 0 indicates that thread monitoring is disabled. A larger value, for example, 1024, indicates that thread alarms will be generated in the environment. If the alarm threshold is high, the performance is affected. You are advised to set this parameter to the default value 10 or a smaller value. If the impact is huge, you are advised to set this parameter to 0 to disable thread monitoring.
      • The value of PSCNT_MONITOR in /etc/sysconfig/sysmonitor and the value of SHOW_TOP_PROC_NUM in /etc/sysmonitor/pscnt determine whether thread monitoring is enabled.
        • If PSCNT_MONITOR is on and SHOW_TOP_PROC_NUM is set to a valid value, thread monitoring is enabled.
        • If PSCNT_MONITOR is on and SHOW_TOP_PROC_NUM is 0, thread monitoring is disabled.
        • If PSCNT_MONITOR is off, thread monitoring is disabled.
      • When a process count alarm is generated, the system FD usage information and memory information (/proc/meminfo) are printed.
      • When a thread count alarm is generated, the total number of threads, top process information, number of processes in the current environment, number of system FDs, and memory information (/proc/meminfo) are printed.
      • If system resources are insufficient before a monitoring period ends, for example, the thread count exceeds the maximum number allowed, the monitoring cannot run properly due to resource limitation. As a result, the alarm cannot be generated.

      Exception Logs

      If a process count alarm is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]:---------------process count alarm start: ---------------
      info|sysmonitor[127]: process count alarm:1657
      info|sysmonitor[127]: process count alarm, show sys fd count: 2592
      info|sysmonitor[127]: process count alarm, show mem info
      info|sysmonitor[127]:---------------show /proc/meminfo: ---------------
      info|sysmonitor[127]:MemTotal: 3496388 kB
      info|sysmonitor[127]:MemFree: 2738100 kB
      info|sysmonitor[127]:MemAvailable: 2901888 kB
      info|sysmonitor[127]:Buffers: 165064 kB
      info|sysmonitor[127]:Cached: 282360 kB
      info|sysmonitor[127]:SwapCached: 4492 kB
      ......
      info|sysmonitor[127]:---------------show_memory_info end. ---------------
      info|sysmonitor[127]:---------------process count alarm end: ---------------
      

      If a process count recovery alarm is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]: process count resume: 1200
      

      If a thread count alarm is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]:---------------threads count alarm start: ---------------
      info|sysmonitor[127]:threads count alarm: 273
      info|sysmonitor[127]:open threads most 10 processes is [top1:pid=1756900,openthreadsnum=13,cmd=/usr/bin/sysmonitor --daemon]
      info|sysmonitor[127]:open threads most 10 processes is [top2:pid=3130,openthreadsnum=13,cmd=/usr/lib/gassproxy -D]
      .....
      info|sysmonitor[127]:---------------threads count alarm end. ---------------
      

      System FD Count Monitoring

      Introduction

      Monitors the number of system FDs. When the total number of system FDs exceeds or is less than the threshold, a log is recorded.

      Configuration File Description

      The configuration file is /etc/sysmonitor/sys_fd_conf.

      # system fd usage alarm percent
      SYS_FD_ALARM="80"
      # system fd usage alarm resume percent
      SYS_FD_RESUME="70"
      # monitor period (second)
      SYS_FD_PERIOD="600"
      

      Configuration items:

      ItemDescriptionMandatoryDefault Value
      SYS_FD_ALARMInteger greater than 0 and less than 100, indicating the alarm threshold of the percentage of the total number of FDs and the maximum number of FDs allowed.No80
      SYS_FD_RESUMEInteger greater than 0 and less than 100, indicating the alarm recovery threshold of the percentage of the total number of FDs and the maximum number of FDs allowed.No70
      SYS_FD_PERIODInteger between 100 and 86400, indicating the monitor period in secondsNo600
      • After modifying the configuration file for FD count monitoring, run systemctl reload sysmonitor. The new configuration takes effect after a monitoring period.
      • The value of SYS_FD_ALARM must be greater than that of SYS_FD_RESUME. If the value is invalid, the default value is used and a log is recorded.

      Exception Logs

      An FD count alarm is recorded in the monitoring logs when detected. The following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]: sys fd count alarm: 259296
      

      When a system FD usage alarm is generated, the top three processes that use the most FDs are printed.

      info|sysmonitor[127]:open fd most three processes is:[top1:pid=23233,openfdnum=5000,cmd=/home/openfile]
      info|sysmonitor[127]:open fd most three processes is:[top2:pid=23267,openfdnum=5000,cmd=/home/openfile]
      info|sysmonitor[127]:open fd most three processes is:[top3:pid=30144,openfdnum=5000,cmd=/home/openfile]
      

      Drive Inode Monitoring

      Introduction

      Periodically monitors the inodes of mounted drive partitions. When the drive partition inode usage is greater than or equal to the configured alarm threshold, the system records a drive inode alarm. When the drive inode usage falls below the configured alarm recovery threshold, a drive inode recovery alarm is recorded.

      Configuration File Description

      The configuration file is /etc/sysmonitor/inode.

      DISK="/"
      DISK="/var/log"
      
      ItemDescriptionMandatoryDefault Value
      DISKMount directoryYesNone
      ALARMInteger indicating the drive inode alarm thresholdNo90
      RESUMEInteger indicating the drive inode alarm recovery thresholdNo80
      • After modifying the configuration file for drive inode monitoring, run systemctl reload sysmonitor. The new configuration takes effect after a monitoring period.
      • If a mount directory is configured repeatedly, the last configuration item takes effect.
      • The value of ALARM must be greater than that of RESUME.
      • Only the mount point or the drive partition of the mount point can be monitored.
      • When the CPU usage and I/O usage are high, the df command execution may time out. As a result, the drive inode usage cannot be obtained.
      • If a drive partition is mounted to multiple mount points, an alarm is reported for each mount point.

      Exception Logs

      If a drive inode alarm is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[4570]:report disk inode alarm, /var/log used:90% alarm:90%
      info|sysmonitor[4570]:report disk inode recovered, /var/log used:79% alarm:80%
      

      Local Drive I/O Latency Monitoring

      Introduction

      Reads the local drive I/O latency data every 5 seconds and collects statistics on 60 groups of data every 5 minutes. If more than 30 groups of data are greater than the configured maximum I/O latency, the system records a log indicating excessive drive I/O latency.

      Configuration File Description

      The configuration file is /etc/sysmonitor/iodelay.

      DELAY_VALUE="500"
      
      ItemDescriptionMandatoryDefault Value
      DELAY_VALUEMaximum drive I/O latencyYes500

      Exception Logs

      If a drive I/O latency alarm is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]:local disk sda IO delay is too large, I/O delay threshold is 70.
      info|sysmonitor[127]:disk is sda, io delay data: 71 72 75 87 99 29 78 ......
      

      If a drive I/O latency recovery alarm is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]:local disk sda IO delay is normal, I/O delay threshold is 70.
      info|sysmonitor[127]:disk is sda, io delay data: 11 22 35 8 9 29 38 ......
      

      Zombie Process Monitoring

      Introduction

      Monitors the number of zombie processes in the system. If the number is greater than the alarm threshold, an alarm log is recorded. When the number drops lower than the recovery threshold, a recovery alarm is reported.

      Configuration File Description

      The configuration file is /etc/sysmonitor/zombie.

      # Ceiling zombie process counts of alarm
      ALARM="500"
      
      # Floor zombie process counts of resume
      RESUME="400"
      
      # Periodic (second)
      PERIOD="600"
      
      ItemDescriptionMandatoryDefault Value
      ALARMNumber greater than 0, indicating the zombie process count alarm thresholdNo500
      RESUMENumber greater than or equal to 0, indicating the zombie process count recovery thresholdNo400
      PERIODMonitoring period, in seconds. The value is greater than 0.No60

      Exception Logs

      If a zombie process count alarm is detected, the following information is displayed in the /var/log/sysmonitor.log file:

      info|sysmonitor[127]: zombie process count alarm: 600
      info|sysmonitor[127]: zombie process count resume: 100
      

      Custom Monitoring

      Introduction

      You can customize monitoring items. The monitoring framework reads the content of the configuration file, parses the monitoring attributes, and calls the monitoring actions to be performed. The monitoring module provides only the monitoring framework. It is not aware of what users are monitoring or how to monitor, and does not report alarms.

      Configuration File Description

      The configuration files are stored in /etc/sysmonitor.d/. Each process or module corresponds to a configuration file.

      MONITOR_SWITCH="on"
      TYPE="periodic"
      EXECSTART="/usr/sbin/iomonitor_daemon"
      PERIOD="1800"
      
      ItemDescriptionMandatoryDefault Value
      MONITOR_SWITCHMonitoring switchNooff
      TYPECustom monitoring item type
      daemon: background execution
      periodic: periodic execution
      YesNone
      EXECSTARTMonitoring commandYesNone
      ENVIROMENTFILEEnvironment variable fileNoNone
      PERIODIf the type is periodic, this parameter is mandatory and sets the monitoring period. The value is an integer greater than 0.Yes when the type is periodicNone
      • The absolute path of the configuration file or environment variable file cannot contain more than 127 characters. The environment variable file path cannot be a soft link path.
      • The length of the EXECSTART command cannot exceed 159 characters. No space is allowed in the key field.
      • The execution of the periodic monitoring command cannot time out. Otherwise, the custom monitoring framework will be affected.
      • Currently, a maximum of 256 environment variables can be configured.
      • The custom monitoring of the daemon type checks whether the reload command is delivered or whether the daemon process exits abnormally every 10 seconds. If the reload command is delivered, the new configuration is loaded 10 seconds later. If a daemon process exits abnormally, the daemon process is restarted 10 seconds later.
      • If the content of the ENVIROMENTFILE file changes, for example, an environment variable is added or the environment variable value changes, you need to restart the sysmonitor service for the new environment variable to take effect.
      • You are advised to set the permission on the configuration files in the /etc/sysmonitor.d/ directory to 600. If EXECSTART is only an executable file, you are advised to set the permission on the executable file to 550.
      • After a daemon process exits abnormally, sysmonitor reloads the configuration file of the daemon process.

      Exception Logs

      If a monitoring item of the daemon type exits abnormally, the /var/log/sysmonitor.log file records the following information:

      info|sysmonitor[127]: custom daemon monitor: child process[11609] name unetwork_alarm exit code[127],[1] times.
      

      Bug Catching

      Buggy Content

      Bug Description

      Submit As Issue

      It's a little complicated....

      I'd like to ask someone.

      PR

      Just a small problem.

      I can fix it online!

      Bug Type
      Specifications and Common Mistakes

      ● Misspellings or punctuation mistakes;

      ● Incorrect links, empty cells, or wrong formats;

      ● Chinese characters in English context;

      ● Minor inconsistencies between the UI and descriptions;

      ● Low writing fluency that does not affect understanding;

      ● Incorrect version numbers, including software package names and version numbers on the UI.

      Usability

      ● Incorrect or missing key steps;

      ● Missing prerequisites or precautions;

      ● Ambiguous figures, tables, or texts;

      ● Unclear logic, such as missing classifications, items, and steps.

      Correctness

      ● Technical principles, function descriptions, or specifications inconsistent with those of the software;

      ● Incorrect schematic or architecture diagrams;

      ● Incorrect commands or command parameters;

      ● Incorrect code;

      ● Commands inconsistent with the functions;

      ● Wrong screenshots.

      Risk Warnings

      ● Lack of risk warnings for operations that may damage the system or important data.

      Content Compliance

      ● Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions;

      ● Copyright infringement.

      How satisfied are you with this document

      Not satisfied at all
      Very satisfied
      Submit
      Click to create an issue. An issue template will be automatically generated based on your feedback.
      Bug Catching
      编组 3备份