hub / github.com/kubernetes/node-problem-detector

github.com/kubernetes/node-problem-detector @v1.35.2 sqlite

repository ↗ · DeepWiki ↗ · release v1.35.2 ↗

583 symbols 1,763 edges 149 files 250 documented · 43%

README

node-problem-detector

node-problem-detector aims to make various node problems visible to the upstream layers in the cluster management stack. It is a daemon that runs on each node, detects node problems and reports them to apiserver. node-problem-detector can either run as a DaemonSet or run standalone. Now it is running as a Kubernetes Addon enabled by default in the GKE cluster. It is also enabled by default in AKS as part of the AKS Linux Extension.

Background

There are tons of node problems that could possibly affect the pods running on the node, such as: * Infrastructure daemon issues: ntp service down; * Hardware issues: Bad CPU, memory or disk; * Kernel issues: Kernel deadlock, corrupted file system; * Container runtime issues: Unresponsive runtime daemon; * ...

Currently, these problems are invisible to the upstream layers in the cluster management stack, so Kubernetes will continue scheduling pods to the bad nodes.

To solve this problem, we introduced this new daemon node-problem-detector to collect node problems from various daemons and make them visible to the upstream layers. Once upstream layers have visibility to those problems, we can discuss the remedy system.

Problem API

node-problem-detector uses Event and NodeCondition to report problems to apiserver. * NodeCondition: Permanent problem that makes the node unavailable for pods should be reported as NodeCondition. * Event: Temporary problem that has limited impact on pod but is informative should be reported as Event.

Problem Daemon

A problem daemon is a sub-daemon of node-problem-detector. It monitors specific kinds of node problems and reports them to node-problem-detector.

A problem daemon could be: * A tiny daemon designed for dedicated Kubernetes use-cases. * An existing node health monitoring daemon integrated with node-problem-detector.

Currently, a problem daemon is running as a goroutine in the node-problem-detector binary. In the future, we'll separate node-problem-detector and problem daemons into different containers, and compose them with pod specification.

Each category of problem daemon can be disabled at compilation time by setting corresponding build tags. If they are disabled at compilation time, then all their build dependencies, global variables and background goroutines will be trimmed out of the compiled executable.

List of supported problem daemons types:

Problem Daemon Types	NodeCondition	Description	Configs	Disabling Build Tag
SystemLogMonitor	KernelDeadlock ReadonlyFilesystem FrequentKubeletRestart FrequentDockerRestart FrequentContainerdRestart	A system log monitor monitors system log and reports problems and metrics according to predefined rules.	filelog, kmsg, kernel abrt systemd	disable_system_log_monitor
SystemStatsMonitor	None(Could be added in the future)	A system stats monitor for node-problem-detector to collect various health-related system stats as metrics. See the proposal here.	system-stats-monitor	disable_system_stats_monitor
CustomPluginMonitor	On-demand(According to users configuration), existing example: NTPProblem	A custom plugin monitor for node-problem-detector to invoke and check various node problems with user-defined check scripts. See the proposal here.	example	disable_custom_plugin_monitor
HealthChecker	KubeletUnhealthy ContainerRuntimeUnhealthy	A health checker for node-problem-detector to check kubelet and container runtime health.	kubelet docker containerd

Exporter

An exporter is a component of node-problem-detector. It reports node problems and/or metrics to certain backends. Some of them can be disabled at compile-time using a build tag. List of supported exporters:

Exporter	Description	Disabling Build Tag
Kubernetes exporter	Kubernetes exporter reports node problems to Kubernetes API server: temporary problems get reported as Events, and permanent problems get reported as Node Conditions.
Prometheus exporter	Prometheus exporter reports node problems and metrics locally as Prometheus metrics
Stackdriver exporter	Stackdriver exporter reports node problems and metrics to Stackdriver Monitoring API.	disable_stackdriver_exporter

Usage

Flags

--version: Print current version of node-problem-detector.
--hostname-override: A customized node name used for node-problem-detector to update conditions and emit events. node-problem-detector gets node name first from hostname-override, then NODE_NAME environment variable and finally fall back to os.Hostname.

For System Log Monitor

--config.system-log-monitor: List of paths to system log monitor configuration files, comma-separated, e.g. config/kernel-monitor.json. Node problem detector will start a separate log monitor for each configuration. You can use different log monitors to monitor different system logs.

For System Stats Monitor

--config.system-stats-monitor: List of paths to system stats monitor config files, comma-separated, e.g. config/system-stats-monitor.json. Node problem detector will start a separate system stats monitor for each configuration. You can use different system stats monitors to monitor different problem-related system stats.

For Custom Plugin Monitor

--config.custom-plugin-monitor: List of paths to custom plugin monitor config files, comma-separated, e.g. config/custom-plugin-monitor.json. Node problem detector will start a separate custom plugin monitor for each configuration. You can use different custom plugin monitors to monitor different node problems.

For Health Checkers

Health checkers are configured as custom plugins, using the config/health-checker-*.json config files.

For Kubernetes exporter

--enable-k8s-exporter: Enables reporting to Kubernetes API server, default to true.
--apiserver-override: A URI parameter used to customize how node-problem-detector connects the apiserver. This is ignored if --enable-k8s-exporter is false. The format is the same as the source flag of Heapster. For example, to run without auth, use the following config: http://APISERVER_IP:APISERVER_PORT?inClusterConfig=false Refer to heapster docs for a complete list of available options.
--address: The address to bind the node problem detector server.
--port: The port to bind the node problem detector server. Use 0 to disable.

For Prometheus exporter

--prometheus-address: The address to bind the Prometheus scrape endpoint, default to 127.0.0.1.
--prometheus-port: The port to bind the Prometheus scrape endpoint, default to 20257. Use 0 to disable.

For Stackdriver exporter

--exporter.stackdriver: Path to a Stackdriver exporter config file, e.g. config/exporter/stackdriver-exporter.json, defaults to empty string. Set to empty string to disable.

Deprecated Flags

--system-log-monitors: List of paths to system log monitor config files, comma-separated. This option is deprecated, replaced by --config.system-log-monitor, and will be removed. NPD will panic if both --system-log-monitors and --config.system-log-monitor are set.
--custom-plugin-monitors: List of paths to custom plugin monitor config files, comma-separated. This option is deprecated, replaced by --config.custom-plugin-monitor, and will be removed. NPD will panic if both --custom-plugin-monitors and --config.custom-plugin-monitor are set.

Build Image

Install development dependencies for libsystemd and the ARM GCC toolchain
Debian/Ubuntu: apt install libsystemd-dev gcc-aarch64-linux-gnu
git clone git@github.com:kubernetes/node-problem-detector.git
Run make in the top directory. It will:
Build the binary.
Build the docker image. The binary and config/ are copied into the docker image.

If you do not need certain categories of problem daemons, you could choose to disable them at compilation time. This is the best way of keeping your node-problem-detector runtime compact without unnecessary code (e.g. global variables, goroutines, etc). You can do so via setting the BUILD_TAGS environment variable before running make. For example:

BUILD_TAGS="disable_custom_plugin_monitor disable_system_stats_monitor" make

The above command will compile the node-problem-detector without Custom Plugin Monitor and System Stats Monitor. Check out the Problem Daemon section to see how to disable each problem daemon during compilation time.

Push Image

make push uploads the docker image to a registry. By default, the image will be uploaded to staging-k8s.gcr.io. It's easy to modify the Makefile to push the image to another registry.

Installation

The easiest way to install node-problem-detector into your cluster is to use the Helm chart:

helm repo add deliveryhero https://charts.deliveryhero.io/
helm install --generate-name deliveryhero/node-problem-detector

Alternatively, to install node-problem-detector manually:

Edit node-problem-detector.yaml to fit your environment. Set log volume to your system log directory (used by SystemLogMonitor). You can use a ConfigMap to overwrite the config directory inside the pod.
Edit node-problem-detector-config.yaml to configure node-problem-detector.
Edit rbac.yaml to fit your environment.
Create the ServiceAccount and ClusterRoleBinding with kubectl create -f rbac.yaml.
Create the ConfigMap with kubectl create -f node-problem-detector-config.yaml.
Create the DaemonSet with kubectl create -f node-problem-detector.yaml.

Start Standalone

To run node-problem-detector standalone, you should set inClusterConfig to false and teach node-problem-detector how to access apiserver with apiserver-override.

To run node-problem-detector standalone with an insecure apiserver connection:

node-problem-detector --apiserver-override=http://APISERVER_IP:APISERVER_INSECURE_PORT?inClusterConfig=false

For more scenarios, see here

Windows

Node Problem Detector has preliminary support Windows. Most of the functionality has not been tested but filelog plugin works.

Follow Issue #461 for development status of Windows support.

Development

To develop NPD on Window

Extension points exported contracts — how you extend this code

Exporter (Interface)

Exporter exports machine health data to certain control plane. [4 implementers]

pkg/types/types.go

LogWatcher (Interface)

LogWatcher is the interface of a log watcher. [4 implementers]

pkg/systemlogmonitor/logwatchers/types/log_watcher.go

Int64MetricInterface (Interface)

Int64MetricInterface is used to create test double for Int64Metric. [3 implementers]

pkg/util/metrics/fakes.go

ProblemDetector (Interface)

ProblemDetector collects statuses from all problem daemons and update the node condition and send node event. [2 implementers]

pkg/problemdetector/problem_detector.go

Client (Interface)

Client is the interface of problem client [2 implementers]

pkg/exporters/k8sexporter/problemclient/problem_client.go

LogCounter (Interface)

LogCounter is the interface for a log counter. [1 implementers]

pkg/logcounter/types/types.go

HealthChecker (Interface)

(no doc) [1 implementers]

pkg/healthchecker/types/types.go

Monitor (Interface)

Monitor monitors the system and reports problems and metrics according to the rules. [3 implementers]

pkg/types/types.go

Core symbols most depended-on inside this repo

Record

called by 70

pkg/util/metrics/fakes.go

String

called by 34

pkg/systemlogmonitor/log_buffer.go

Run

called by 31

pkg/problemdetector/problem_detector.go

NewInt64Metric

called by 21

pkg/util/metrics/metric_int64.go

called by 20

test/e2e/lib/ssh/lib.go

Done

called by 18

pkg/util/tomb/tomb.go

collect

called by 17

pkg/systemstatsmonitor/net_collector.go

mustRegisterMetric

called by 16

pkg/systemstatsmonitor/net_collector.go

Shape

Function 283

Method 188

Struct 87

Interface 14

TypeAlias 9

FuncType 2

Languages

Go100%

Modules by API surface

test/e2e/lib/ssh/lib.go39 symbols

test/e2e/lib/ssh/lib_test.go20 symbols

pkg/types/types.go18 symbols

pkg/exporters/k8sexporter/problemclient/problem_client.go14 symbols

pkg/exporters/k8sexporter/condition/manager.go14 symbols

pkg/systemstatsmonitor/types/config.go13 symbols

pkg/systemstatsmonitor/net_collector.go13 symbols

pkg/systemlogmonitor/log_monitor.go12 symbols

pkg/custompluginmonitor/custom_plugin_monitor.go12 symbols

pkg/systemlogmonitor/log_buffer.go11 symbols

pkg/healthchecker/types/types.go11 symbols

pkg/exporters/stackdriver/stackdriver_exporter.go9 symbols

Dependencies from manifests, versioned

cloud.google.com/go/authv0.17.0 · 1×

cloud.google.com/go/auth/oauth2adaptv0.2.8 · 1×

cloud.google.com/go/compute/metadatav0.9.0 · 1×

cloud.google.com/go/monitoringv1.24.2 · 1×

cloud.google.com/go/tracev1.11.6 · 1×

contrib.go.opencensus.io/exporter/prometheusv0.4.2 · 1×

contrib.go.opencensus.io/exporter/stackdriverv0.13.14 · 1×

github.com/acobaugh/osreleasev0.1.0 · 1×

github.com/avast/retry-go/v4v4.7.0 · 1×

github.com/aws/aws-sdk-gov1.44.72 · 1×

github.com/beorn7/perksv1.0.1 · 1×

github.com/blang/semver/v4v4.0.0 · 1×

For agents

$ claude mcp add node-problem-detector \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact