Loading...

Lab 45: System Monitoring Commands

Diagnose degraded server performance using a fast triage workflow across disk, kernel logs, memory, I/O, network sockets, and live processes. Identify signals of resource exhaustion versus hardware failure and capture evidence for escalation.

troubleshooting core storage

Scenario

A user reports the server has become slow and unreliable. You are on-call and need to assess system health quickly. Start with disk utilization, then check kernel messages for hardware warnings, validate memory pressure, review I/O signals, confirm key listeners, and finish with a live process view.

Operator context

This is the initial triage pass that helps you decide whether you are dealing with resource exhaustion, a noisy process, or a hardware issue that needs escalation.

Objective

  • Check disk usage to identify capacity risk on mounted file systems.
  • Inspect recent kernel messages for hardware and filesystem warnings.
  • Verify current memory and swap usage to assess memory pressure.
  • Review I/O wait and disk throughput indicators.
  • Confirm listening sockets to validate service exposure.
  • Use a live monitor to identify top CPU and memory consumers.

Concepts

  • Capacity triage and mountpoint risk using df -h .
  • Kernel log triage for I/O errors and filesystem state using dmesg .
  • Memory pressure signals and swap activity using free -h .
  • CPU versus storage bottlenecks and %iowait using iostat -x .
  • Listener validation and exposure checks using ss -tuln .
  • Live process triage and “who is doing it” confirmation using top .

Walkthrough

Step 1 : Check disk usage across mounted file systems.
Command
df -h

This is the fastest way to spot capacity risk. A nearly full filesystem can degrade performance (blocked writes, failed logging, unpredictable service behavior) and can prevent recovery actions like updates or package installs.

# Look for mountpoints near capacity:
# /dev/sda1  30G  28G  1.0G  97%  /
Step 2 : Review recent kernel messages for hardware warnings.
Command
dmesg | tail

Recent dmesg output often separates “system is slow” from “storage is failing.” I/O errors, timeouts, and filesystems remounting read-only are high-severity signals that should trigger escalation and data-protection actions.

# High-severity patterns include:
# blk_update_request: I/O error, dev sda, ...
# EXT4-fs (sda1): Remounting filesystem read-only
Step 3 : Check current memory and swap usage.
Command
free -h

Memory pressure can present as slowness, timeouts, and thrashing. Use available as a practical view of memory that can be used without forcing cache eviction. Persistent swap growth usually means you are running short on RAM.

# Watch for low available memory and active swap usage.
Step 4 : Inspect CPU and I/O pressure indicators.
Safety note

iostat is provided by the sysstat package. If it is not installed, you may need to install it during a maintenance window or use alternate tooling.

Command
iostat -x

iostat -x helps you spot storage pressure and I/O wait. Elevated %iowait can indicate slow disks, saturated throughput, or failing hardware forcing retries. If CPU looks mostly idle while %iowait is high, storage is often the bottleneck.

# Focus areas:
# avg-cpu: %iowait
# Per-device utilization and latency signals (if present in your output).
Step 5 : Confirm listening sockets for key services.
Command
ss -tuln

Listener checks confirm which services are exposed and whether expected ports are open. This is a fast way to verify “service is up” versus “service is running but not reachable.” Use numeric output to avoid DNS resolution delays during incident response.

# Look for expected listeners:
# tcp LISTEN 0 0 0.0.0.0:22 ...
# tcp LISTEN 0 0 127.0.0.1:3306 ...
Step 6 : Launch a live process monitor.
Command
top

top provides a live view of load, CPU state (including I/O wait), memory usage, and the processes consuming resources right now. Use this to identify the immediate top consumers and decide whether you need to throttle, restart, or isolate a workload.

# Watch for:
# - Sustained high load with low %idle
# - High %wa (I/O wait)
# - A single process dominating CPU or memory

Common breakpoints

Root filesystem near 100%

If / is close to full, treat it as a priority incident. Free space immediately (logs, caches, crash dumps), then confirm services recover and writes are no longer blocked.

dmesg shows I/O errors or read-only remount

This is a strong indicator of failing storage or a degraded path. Capture logs, identify impacted mountpoints, and escalate. Do not keep retrying writes into a failing disk.

High swap usage and low available memory

Swapping can make a system feel “randomly slow.” Identify the top memory consumers and validate whether the workload is normal growth, a leak, or an undersized host.

ss output missing expected listener

If the port is not listening, confirm the service is running and bound to the correct interface. If it is listening only on 127.0.0.1 when you expect external access, fix bind configuration before chasing firewall rules.

Cleanup checklist

This lab is read-only. Your cleanup is to record evidence (outputs and timestamps) and revert any temporary terminal filters or paging choices you used during triage.

Commands
df -h
dmesg | tail
free -h
iostat -x
ss -tuln
top
Success signal

You can state the bottleneck (capacity, memory pressure, storage latency, or process saturation) and you have enough command output to justify next actions.

Reference

  • df -h : Shows disk usage for mounted filesystems in human-readable units.
    • -h : Prints sizes in human-readable units (GiB/MiB).
  • dmesg | tail : Displays the most recent kernel messages.
    • | : Pipes output from the left command into the right command.
    • tail : Shows the last lines of output.
  • free -h : Shows memory and swap usage in human-readable units.
    • -h : Prints sizes in human-readable units (GiB/MiB).
  • iostat -x : Displays extended CPU and device I/O statistics.
    • -x : Enables extended per-device statistics.
  • ss -tuln : Lists TCP/UDP listening sockets without resolving names.
    • -t : TCP sockets.
    • -u : UDP sockets.
    • -l : Listening sockets.
    • -n : Numeric output (no DNS/service name resolution).
  • top : Interactive real-time view of processes, CPU, memory, and load averages.