CPDS Overview

Introduction

CPDS (Container Problem Detect System), developed by Beijing Linx Software Corp., is a fault detection system for container clusters. It monitors and identifies container top faults and sub-health conditions.

Key Features

1. Cluster information collection

The system uses node agents on host machines, leveraging systemd, initv, and eBPF technologies to monitor key container services. It collects data on node networks, kernels, drive LVM, and other critical metrics. It also tracks application status, resource usage, system function execution, and I/O operations within containers for anomalies.

2. Cluster exception detection

The system gathers raw data from cluster nodes and applies predefined rules to detect anomalies, extracting essential information. It uploads both detection results and raw data online while ensuring data persistence.

3. Node and service container fault/sub-health diagnosis

Using exception detection data, the system diagnoses faults or sub-health conditions in nodes and service containers. Analysis results are stored persistently, and a UI layer enables real-time and historical diagnosis data access.

System Architecture

CPDS comprises four components, as illustrated below. The system follows a microservices architecture, with components interacting via APIs.

cpds-agent: Collects raw data about containers and systems from cluster nodes.
cpds-detector: Analyzes node data based on exception rules to detect abnormalities.
cpds-analyzer: Diagnoses node health using configured rules to assess current status.
cpds-dashboard: Provides a web interface for node health visualization and diagnostic rule configuration.

Supported Fault Detection

CPDS detects the following fault conditions.

No.	Fault Detection Item
1	Container service functionality
2	Container node agent functionality
3	Container group functionality
4	Node health detection functionality
5	Log collection functionality
6	Drive usage exceeding 85%
7	Network issues
8	Kernel crashes
9	Residual LVM drive issues
10	CPU usage exceeding 85%
11	Node monitoring functionality
12	Container memory allocation failures
13	Container memory allocation timeouts
14	Container network response timeouts
15	Slow container drive read/write operations
16	Zombie child processes in container applications
17	Child process and thread creation failures in container applications

CPDS Overview ​

Introduction ​

Key Features ​

System Architecture ​

Supported Fault Detection ​

CPDS Overview

Introduction

Key Features

System Architecture

Supported Fault Detection