Monitor CPU/RAM/Disk Like a Pro with Get‑Counter

Was this helpful?

Perf issues don’t announce themselves politely. They show up as “the app feels slow,” “RDP is laggy,” “SQL is stuck,” or my favorite: “it was fine yesterday.” By the time you open Task Manager, the spike has already left the scene of the crime.

PowerShell’s Get-Counter is your surveillance camera: always on, timestamped, automatable, and good in court. If you learn to read a handful of counters like an SRE reads graphs, you can separate CPU starvation from memory pressure from disk latency in minutes—not hours of vibes-based debugging.

Why Get‑Counter beats clicking around

PerfMon is fine. Task Manager is fine. But “fine” is not what you need at 02:13 when a file server starts timing out and you have 15 minutes before a VP finds your phone number.

Get-Counter wins because it’s:

  • Scriptable: repeatable sampling intervals, consistent output, easy exports.
  • Remote-friendly: the same command can run against multiple hosts.
  • Time-series oriented: you can capture a spike, not just stare at a moment.
  • Composable: pipe it, filter it, aggregate it, schedule it.

Also, you don’t need to “remember what you clicked” during an incident. The command itself is the record.

One practical rule: stop asking “is the CPU high?” and start asking “which resource is the limiting reagent right now?” CPU, memory, and disk are a three-way knife fight. Your job is to identify who’s holding the knife.

Joke #1: If you don’t baseline your counters, every graph is “anomaly detection” powered by panic and coffee.

Interesting facts and history (that actually helps)

  • Windows performance counters predate PowerShell. They’ve been around since the NT era; PowerShell later became a convenient way to query them without the GUI.
  • PerfMon is just a client. The counter data comes from providers (like PerfProc, PerfOS) and instrumentation inside the OS and drivers.
  • Instances can disappear. Process counters use instances like chrome#3; when processes restart, instance names can shift, breaking naive automation.
  • Some “classic” counters lie by omission. Disk queue length can be misleading on modern storage stacks with caching and parallelism; latency is usually the better truth.
  • Hypervisors changed what “CPU %” means. Ready time, stolen time, and host scheduling can cause slowdowns even when guest CPU looks moderate.
  • Counter types matter. Some values are rates (per second), others are raw counts, others are fractions; CookedValue is PowerShell doing the math for you.
  • Sampling interval changes interpretation. A 1-second sample catches spikes; a 30-second sample hides them. That’s not philosophy—it’s math.
  • Many counters are computed, not measured. For example, “% Processor Time” is derived from idle time deltas, not a magical CPU meter.
  • Remote querying uses RPC/Perf infrastructure. It can be blocked by firewall policy, service hardening, or permissions even when WinRM works.

A working mental model: CPU vs RAM vs Disk

CPU bottlenecks: fast problems, obvious symptoms, misleading “fixes”

CPU problems often show up as high % Processor Time, long run queues, and sluggish response everywhere. But CPU is also where people get trapped: they see 80–90% CPU and immediately ask for more cores. Sometimes that’s right. Often it’s a band-aid covering bad code, an aggressive antivirus scan, a runaway log formatter, or a busy-wait loop that should have been a sleep.

CPU is also where virtualization hides sins. A guest can look “fine” while the host is oversubscribed and your VM is waiting to be scheduled. If you only measure inside the guest, you can miss the real problem entirely.

Memory bottlenecks: the slow-motion disaster

Memory pressure is the kind of problem that ruins your day slowly, then suddenly. Windows will try very hard to keep things running by trimming working sets and paging. By the time users complain, you’re often already in “paging treadmill” territory: disk starts to thrash, CPU starts to spike, and everything becomes inconsistent. That’s why memory diagnosis must include both available memory and paging activity.

Disk bottlenecks: latency is the king metric

Disk issues are rarely about throughput ceilings; they’re about latency and tail latency. A storage system can do impressive MB/s and still ruin your app if reads take 50–200ms during bursts. For most Windows workloads, if you can answer “what’s the read/write latency right now?” you’re halfway to the fix.

And yes, “disk” could mean a SAN, a cloud volume, Storage Spaces, a RAID controller cache, an antivirus filter driver, or a filesystem metadata storm. Counters won’t name the villain automatically, but they will tell you whether to chase I/O, CPU, or memory first.

One quote that belongs on every on-call runbook, because it keeps you honest:

“Hope is not a strategy.” — General Gordon R. Sullivan

Use counters. Not hope.

Practical tasks (commands + output + decisions)

Below are real tasks you can run during incidents or for baselining. Each includes: the command, what the output means, and the decision you make from it.

Task 1: List available counters for CPU, memory, disk

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter -ListSet Processor,Memory,PhysicalDisk | Select-Object -ExpandProperty Counter"
\\Processor(*)\\% Processor Time
\\Processor(*)\\% Privileged Time
\\Processor(*)\\% User Time
\\Processor(*)\\Interrupts/sec
\\Processor(*)\\% Idle Time
\\Memory\\Available MBytes
\\Memory\\Committed Bytes
\\Memory\\% Committed Bytes In Use
\\Memory\\Cache Faults/sec
\\Memory\\Pages/sec
\\PhysicalDisk(*)\\Avg. Disk sec/Read
\\PhysicalDisk(*)\\Avg. Disk sec/Write
\\PhysicalDisk(*)\\Disk Reads/sec
\\PhysicalDisk(*)\\Disk Writes/sec
\\PhysicalDisk(*)\\Current Disk Queue Length

What it means: Your system exposes these counters; naming varies by OS version and installed roles. You can’t query what isn’t present.

Decision: Choose counters by what question you’re answering (latency, throughput, saturation), not by what looks familiar.

Task 2: Quick CPU snapshot (overall)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\Processor(_Total)\% Processor Time' | Select-Object -ExpandProperty CounterSamples | Select-Object Path,CookedValue"
Path                                   CookedValue
----                                   -----------
\\server\\processor(_total)\\% processor time   37.248

What it means: CookedValue is a percentage. A single sample is a hint, not a verdict.

Decision: If it’s high, don’t act yet. Take a short time series next.

Task 3: CPU time series (catch spikes)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\Processor(_Total)\% Processor Time' -SampleInterval 1 -MaxSamples 10 | Select-Object -ExpandProperty CounterSamples | Select-Object TimeStamp,CookedValue"
TimeStamp                CookedValue
---------                -----------
2/5/2026 2:13:01 AM      41.1
2/5/2026 2:13:02 AM      92.3
2/5/2026 2:13:03 AM      88.7
2/5/2026 2:13:04 AM      54.9
2/5/2026 2:13:05 AM      39.2
2/5/2026 2:13:06 AM      36.8
2/5/2026 2:13:07 AM      35.5
2/5/2026 2:13:08 AM      34.9
2/5/2026 2:13:09 AM      35.2
2/5/2026 2:13:10 AM      36.1

What it means: You had a real spike (90%+) for a couple seconds. That could be normal (GC, compaction, scheduled tasks) or pathological.

Decision: If spikes correlate with user pain, find whether the CPU is doing user work or kernel work next.

Task 4: CPU user vs privileged time (kernel pressure)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\Processor(_Total)\% User Time','\Processor(_Total)\% Privileged Time' -SampleInterval 1 -MaxSamples 5 | Select-Object -ExpandProperty CounterSamples | Select-Object Path,CookedValue"
Path                                                CookedValue
----                                                -----------
\\server\\processor(_total)\\% user time             22.4
\\server\\processor(_total)\\% privileged time       18.7
\\server\\processor(_total)\\% user time             23.1
\\server\\processor(_total)\\% privileged time       41.2
\\server\\processor(_total)\\% user time             21.9
\\server\\processor(_total)\\% privileged time       39.5

What it means: Privileged time spiking suggests kernel work: drivers, storage stack, antivirus filter drivers, heavy networking, context switching, or interrupt storms.

Decision: If privileged time is high, look at disk latency and interrupts before you blame “the application.”

Task 5: Processor queue length (are threads waiting?)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\System\Processor Queue Length' -SampleInterval 1 -MaxSamples 10 | Select-Object -ExpandProperty CounterSamples | Select-Object TimeStamp,CookedValue"
TimeStamp                CookedValue
---------                -----------
2/5/2026 2:14:01 AM      0
2/5/2026 2:14:02 AM      2
2/5/2026 2:14:03 AM      8
2/5/2026 2:14:04 AM      9
2/5/2026 2:14:05 AM      7
2/5/2026 2:14:06 AM      1
2/5/2026 2:14:07 AM      0
2/5/2026 2:14:08 AM      0
2/5/2026 2:14:09 AM      1
2/5/2026 2:14:10 AM      0

What it means: This is runnable threads waiting for CPU. On a multi-core system, interpret it relative to core count. Spikes can be normal; sustained queue suggests CPU contention.

Decision: If queue length stays above a few per core for sustained intervals, you either need to reduce CPU demand or add CPU capacity. If it spikes briefly, chase the spike source.

Task 6: Memory availability (the simplest “are we tight?” check)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\Memory\Available MBytes' -SampleInterval 2 -MaxSamples 5 | Select-Object -ExpandProperty CounterSamples | Select-Object TimeStamp,CookedValue"
TimeStamp                CookedValue
---------                -----------
2/5/2026 2:15:01 AM      612
2/5/2026 2:15:03 AM      590
2/5/2026 2:15:05 AM      571
2/5/2026 2:15:07 AM      548
2/5/2026 2:15:09 AM      530

What it means: Available memory is trending down. The absolute “bad” number depends on role. Domain controllers and file servers tolerate less free memory than memory-hungry database servers.

Decision: If it’s low and falling, check committed bytes and paging activity before you declare “need more RAM.”

Task 7: Commit pressure (are we approaching the commit limit?)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\Memory\% Committed Bytes In Use','\Memory\Committed Bytes' | Select-Object -ExpandProperty CounterSamples | Select-Object Path,CookedValue"
Path                                         CookedValue
----                                         -----------
\\server\\memory\\% committed bytes in use   91.6
\\server\\memory\\committed bytes            2.941943e+10

What it means: Near 90% commit use is a warning flare. Commit is virtual memory that must be backed by RAM or pagefile(s). If commit hits 100%, allocations fail and services fall over in creative ways.

Decision: At sustained 85–95%+ commit, reduce memory usage, fix leaks, or increase RAM/pagefile. Don’t wait for 100%.

Task 8: Paging rate (is the system thrashing?)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\Memory\Pages/sec' -SampleInterval 1 -MaxSamples 10 | Select-Object -ExpandProperty CounterSamples | Select-Object TimeStamp,CookedValue"
TimeStamp                CookedValue
---------                -----------
2/5/2026 2:16:01 AM      12
2/5/2026 2:16:02 AM      18
2/5/2026 2:16:03 AM      220
2/5/2026 2:16:04 AM      410
2/5/2026 2:16:05 AM      395
2/5/2026 2:16:06 AM      205
2/5/2026 2:16:07 AM      44
2/5/2026 2:16:08 AM      16
2/5/2026 2:16:09 AM      14
2/5/2026 2:16:10 AM      13

What it means: A burst of paging can be normal. Sustained high paging alongside low available memory and rising disk latency is classic memory pressure.

Decision: If paging is sustained and disk latency climbs, treat memory as the primary bottleneck even if “disk is busy.” Disk is the victim here.

Task 9: Disk latency per volume (the money shot)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\PhysicalDisk(*)\Avg. Disk sec/Read','\PhysicalDisk(*)\Avg. Disk sec/Write' | Select-Object -ExpandProperty CounterSamples | Sort-Object CookedValue -Descending | Select-Object -First 8 Path,CookedValue"
Path                                                     CookedValue
----                                                     -----------
\\server\\physicaldisk(1 d:)\\avg. disk sec/write         0.187
\\server\\physicaldisk(1 d:)\\avg. disk sec/read          0.142
\\server\\physicaldisk(0 c:)\\avg. disk sec/write         0.021
\\server\\physicaldisk(0 c:)\\avg. disk sec/read          0.009

What it means: Latency is in seconds. 0.187 sec is 187ms write latency, which is bad for most transactional workloads. Reads at 142ms are also not great.

Decision: If a specific volume shows high latency, focus investigation there: what workload is hitting it, what changed, and whether it’s capacity, path, or backend saturation.

Task 10: Disk queue length (context, not a verdict)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\PhysicalDisk(*)\Current Disk Queue Length' | Select-Object -ExpandProperty CounterSamples | Sort-Object CookedValue -Descending | Select-Object -First 6 Path,CookedValue"
Path                                                     CookedValue
----                                                     -----------
\\server\\physicaldisk(1 d:)\\current disk queue length   23
\\server\\physicaldisk(0 c:)\\current disk queue length   1

What it means: Requests waiting on disk. A queue of 23 can be fine on a deep, parallel storage system—or awful on a single SATA disk. Without latency, queue length is half a story.

Decision: Use queue length to support the latency conclusion, not replace it.

Task 11: Disk throughput (are we saturating bandwidth?)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter '\PhysicalDisk(*)\Disk Read Bytes/sec','\PhysicalDisk(*)\Disk Write Bytes/sec' | Select-Object -ExpandProperty CounterSamples | Sort-Object CookedValue -Descending | Select-Object -First 10 Path,CookedValue"
Path                                                        CookedValue
----                                                        -----------
\\server\\physicaldisk(1 d:)\\disk write bytes/sec           8.941122e+07
\\server\\physicaldisk(1 d:)\\disk read bytes/sec            2.110294e+07
\\server\\physicaldisk(0 c:)\\disk write bytes/sec           1.240122e+06
\\server\\physicaldisk(0 c:)\\disk read bytes/sec            3.901220e+06

What it means: Bytes/sec. On D: you’re doing ~89 MB/s writes and ~21 MB/s reads. That might be normal for a backup job, terrible for a latency-sensitive DB, or both.

Decision: If throughput is high and latency is also high, you’re saturating something. If throughput is low but latency is high, you have contention, backend issues, or a pathological IO pattern.

Task 12: Identify which process is eating CPU (top offenders)

cr0x@server:~$ powershell -NoProfile -Command "Get-Process | Sort-Object CPU -Descending | Select-Object -First 8 Name,Id,CPU,WorkingSet64"
Name          Id   CPU WorkingSet64
----          --   --- ------------
w3wp        4120 812.3   1245184000
MsMpEng     2788 410.7    512245760
sqlservr    1556 209.1   8421191680
svchost     1020  88.5    210796544

What it means: This is cumulative CPU time since process start, not instant CPU%. Still useful: if something has absurd CPU time over a short uptime, it’s a candidate.

Decision: If CPU is high right now, correlate this with time series counters; if the top offender is security scanning, reschedule or tune exclusions (carefully, with governance).

Task 13: Remote CPU check across multiple servers

cr0x@server:~$ powershell -NoProfile -Command "$servers='app01','app02','db01'; Get-Counter '\Processor(_Total)\% Processor Time' -ComputerName $servers | Select-Object -ExpandProperty CounterSamples | Select-Object PSComputerName,TimeStamp,CookedValue"
PSComputerName TimeStamp                CookedValue
-------------- ---------                -----------
app01          2/5/2026 2:17:01 AM      18.2
app02          2/5/2026 2:17:01 AM      22.7
db01           2/5/2026 2:17:01 AM      79.4

What it means: You just established blast radius: the DB is hot, apps are not. That’s how you avoid random walk debugging.

Decision: Focus on db01 next: disk latency, buffer cache behavior, paging, and query pressure.

Task 14: Export a short capture to CSV for evidence

cr0x@server:~$ powershell -NoProfile -Command "$c='\Processor(_Total)\% Processor Time','\Memory\Available MBytes','\Memory\Pages/sec','\PhysicalDisk(_Total)\Avg. Disk sec/Read','\PhysicalDisk(_Total)\Avg. Disk sec/Write'; Get-Counter $c -SampleInterval 2 -MaxSamples 30 | Export-Counter -Path C:\temp\triage.blg"

What it means: You captured a 60-second window into a BLG file. That’s PerfMon-native and can be opened later or parsed.

Decision: During incidents, always capture evidence before restarting things. “We rebooted and it went away” is not a root cause.

Task 15: Parse a BLG and convert to CSV (shareable)

cr0x@server:~$ powershell -NoProfile -Command "Import-Counter C:\temp\triage.blg | Export-Counter -FileFormat CSV -Path C:\temp\triage.csv"

What it means: Now you have a CSV you can open in Excel, ingest into a time-series store, or diff against a baseline.

Decision: Use the CSV to show trends (latency rising with paging, CPU spiking with kernel time) and to avoid “it feels like” debates.

Task 16: Monitor a specific volume instance reliably (avoid the wrong instance trap)

cr0x@server:~$ powershell -NoProfile -Command "Get-Counter -ListSet PhysicalDisk | Select-Object -ExpandProperty PathsWithInstances | Where-Object { $_ -like '*Avg. Disk sec/Read*' } | Select-Object -First 8"
\\PhysicalDisk(0 C:)\\Avg. Disk sec/Read
\\PhysicalDisk(1 D:)\\Avg. Disk sec/Read
\\PhysicalDisk(_Total)\\Avg. Disk sec/Read
\\PhysicalDisk(2 E:)\\Avg. Disk sec/Read

What it means: You’re enumerating exact instance paths. That saves you from sampling an instance that doesn’t exist on another server (or changed names).

Decision: Always discover instance paths programmatically when writing scripts that run across fleets.

Fast diagnosis playbook

This is the “I have five minutes” sequence. You’re not proving a thesis. You’re finding the bottleneck fast enough to stop the bleeding.

First: establish the symptom scope

  1. Single host or many? Sample CPU quickly across suspected servers (Task 13). If only one box is hot, don’t boil the ocean.
  2. Single volume or many? Check disk latency per volume (Task 9). If only D: is suffering, don’t touch C:.

Second: identify the limiting resource

  1. CPU saturation? Look at \Processor(_Total)\% Processor Time and \System\Processor Queue Length (Tasks 3 and 5). High CPU without queue can still be “busy but coping.” High queue means threads are waiting.
  2. Memory pressure? Check \Memory\Available MBytes, \Memory\% Committed Bytes In Use, and \Memory\Pages/sec (Tasks 6–8). Low available + high commit + sustained paging is the signature.
  3. Disk latency? Check \PhysicalDisk(*)\Avg. Disk sec/Read and Write (Task 9). If latency is high, everything above it will look bad.

Third: decide whether to mitigate or investigate

  • If CPU is the bottleneck: find top processes, check privileged time (Tasks 4 and 12). Mitigate with throttling, rescheduling batch jobs, or temporarily scaling out. Don’t restart services blindly unless you have a leak or runaway thread.
  • If memory is the bottleneck: stop the growth (leaking process, runaway cache, misconfigured service). Mitigate by reducing load, restarting the culprit if necessary, and sizing pagefile appropriately. If you’re paging hard, disk will look guilty—ignore the red herring.
  • If disk latency is the bottleneck: identify the volume and workload. Mitigate by pausing heavy jobs, moving temp paths, verifying backend health, or failing over if you have the architecture. Throwing CPU at storage latency is like yelling at a loading bar.

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption

A mid-sized enterprise had a critical Windows file server that “randomly froze” every weekday morning. The first responder’s assumption was simple: “CPU spikes at 9 AM, so it’s CPU.” They pushed for a bigger VM and more vCPUs. The request was approved because it sounded reasonable, and everyone loves a clean procurement story.

After the change, the freeze still happened. CPU looked lower, but user experience didn’t improve. That should have been the clue: lowering CPU utilization without improving latency means CPU wasn’t the limiter.

The second responder did something boring: they captured a one-minute counter trace during the event. \PhysicalDisk(…)\Avg. Disk sec/Write went from single-digit milliseconds to hundreds of milliseconds. At the same time, \Memory\Pages/sec was elevated and \Processor(_Total)\% Privileged Time spiked. The storage subsystem was timing out under a write-heavy workload.

The culprit wasn’t “disk is slow” in the abstract. It was a scheduled job that dumped millions of tiny files into a single directory on a volume hosted on shared storage, at the exact time VSS snapshots also ran. Metadata churn plus snapshot IO plus backend contention: a perfect storm of boring Windows behaviors.

The fix wasn’t bigger CPU. The fix was rescheduling the job, spreading output across directories, adjusting snapshot timing, and adding a separate volume for that workload. The money saved on unnecessary CPU capacity paid for actual storage improvements later.

Mini-story #2: The optimization that backfired

An application team wanted faster deployments. They tuned a Windows service to cache more data in memory, aiming to reduce database calls. It worked in staging. In production, it worked too—right until traffic ramped and the cache grew beyond its “soft target.”

Memory usage climbed slowly. No one noticed because the servers had “plenty of RAM” and Windows didn’t complain loudly. But \Memory\% Committed Bytes In Use crept up. Then paging began. Disk latency rose. CPU privileged time rose. The system entered the classic paging treadmill: everything was technically “up,” but response times were terrible.

The first attempted fix was to move the pagefile to a faster volume. It helped a bit, just enough to mask the problem and prolong the incident. That’s the trap: optimizing paging is like installing nicer handrails on the stairs to a basement you shouldn’t be living in.

The correct fix was to cap the cache, add eviction, and separate “hot” objects from “nice to have” objects. They also added a basic counter-based alert: if commit use stayed high and paging exceeded a threshold for several minutes, the team got paged before users did.

The lesson: some performance “optimizations” are really resource consumption strategies. If you don’t measure commit and paging, you can ship a time bomb with a feature flag.

Mini-story #3: The boring but correct practice that saved the day

A finance environment ran batch jobs overnight: imports, report generation, and ETL on Windows servers connected to a shared storage system. The jobs were predictable, but the environment had one thing most places lack: baselines. Every week, a scheduled task captured a short counter trace during peak batch windows and archived it.

One night, reports started missing deadlines. The operations team didn’t argue about who changed what. They pulled the week-over-week counter captures and compared disk latency and throughput. Latency jumped on a single volume while throughput stayed flat. That pattern screamed “backend contention or storage path issue,” not “workload increased.”

They then checked a few more servers. Same volume, same time window, same latency signature. That established it wasn’t a single noisy neighbor inside one VM. It was systemic.

Storage engineers found a path failover event on the SAN side that didn’t fully recover to optimal paths, leaving traffic on a suboptimal route. Fixing the path restored latency immediately, and the batch jobs finished within their normal window.

This wasn’t heroism. It was routine evidence collection. The boring discipline—baseline traces—turned a potential all-night war room into a 30-minute fix.

Joke #2: The most reliable monitoring system is the one you set up before your boss learns the word “latency.”

Common mistakes (symptom → root cause → fix)

1) “CPU is only 40% but the server is slow”

Symptom: Users see timeouts; CPU looks moderate.

Root cause: Disk latency or memory pressure causing threads to block on I/O; CPU appears “free” because threads are waiting.

Fix: Check \PhysicalDisk(*)\Avg. Disk sec/Read/Write and \Memory\Pages/sec. If latency is high, fix I/O path or workload. If paging is high, fix memory pressure.

2) “Disk queue length is high, therefore storage is the problem”

Symptom: Queue length spikes; someone pings the storage team.

Root cause: Queue length is context-dependent; it rises under normal parallel I/O and caching. It can also rise during memory pressure paging.

Fix: Use latency as the primary indicator. Queue length without high latency is not an incident. Queue length with high latency is actionable.

3) “We fixed it by adding vCPUs”

Symptom: CPU % decreased after adding cores; users still complain.

Root cause: CPU wasn’t the bottleneck; you just changed the denominator. Or you introduced scheduling/NUMA issues.

Fix: Validate with \System\Processor Queue Length and disk/memory counters. If queue wasn’t high before, CPU wasn’t the limiter. Revert if it complicates NUMA placement.

4) “Available MBytes is low, we’re out of memory”

Symptom: Low available memory; alarms fire; someone orders RAM.

Root cause: Windows uses memory aggressively for cache; low available alone is not proof of pressure.

Fix: Confirm with % Committed Bytes In Use and Pages/sec. Low available + high commit + sustained paging is pressure. Low available + low paging may be fine.

5) “Paging is nonzero, so it’s bad”

Symptom: Pages/sec shows activity; panic ensues.

Root cause: Paging bursts happen; the OS trims and manages working sets. Nonzero is normal; sustained high is not.

Fix: Trend it. Take 60–120 seconds of samples and correlate with latency and user pain.

6) “Get-Counter output is weird; the numbers look scientific”

Symptom: Values like 2.94e+10 appear.

Root cause: PowerShell formats large numbers in scientific notation.

Fix: Format output explicitly (e.g., round/convert units) when reporting to humans. Don’t change how you collect; change how you present.

7) “Counters differ between servers, the script is broken”

Symptom: Remote queries fail or return missing instances.

Root cause: Instance naming differs (disks, NICs, processes); roles/features change available counter sets.

Fix: Discover PathsWithInstances per host before sampling. Avoid hard-coding instance names when you can query by pattern.

8) “We sampled every 60 seconds and saw nothing”

Symptom: Users complain about spikes; counters look calm.

Root cause: You chose a sampling interval that averages away the problem.

Fix: Use 1–2 second sampling during triage, then widen after you catch the signature.

Checklists / step-by-step plan

Checklist A: Build a baseline (do this once, thank yourself later)

  1. Pick 10–15 counters for your role (web/app/DB/file server). Minimum set:
    • \Processor(_Total)\% Processor Time
    • \System\Processor Queue Length
    • \Memory\Available MBytes
    • \Memory\% Committed Bytes In Use
    • \Memory\Pages/sec
    • \PhysicalDisk(*)\Avg. Disk sec/Read
    • \PhysicalDisk(*)\Avg. Disk sec/Write
  2. Capture for 5–15 minutes during known “normal busy” periods, not at 3 AM when nothing happens.
  3. Store BLG files in a predictable location with timestamps.
  4. Document what “normal” looks like: typical latency ranges, typical CPU peaks, and what jobs run when.

Checklist B: Incident capture (the minimum evidence kit)

  1. Start a 60–120 second capture at 1–2 second intervals for CPU/memory/disk latency (Task 14, tweak counters as needed).
  2. Record what users experience and when (minute-level is fine).
  3. Check whether the issue is host-local or fleet-wide (Task 13).
  4. If you must restart something, do it after you have at least one trace.

Checklist C: Turn triage into monitoring (so you stop reliving the same incident)

  1. Create a scheduled task that runs a short Get-Counter capture during peak windows.
  2. Alert on trends, not single points:
    • sustained high disk latency
    • sustained high commit use
    • sustained paging with low available memory
    • sustained CPU queue length
  3. Review weekly. Not because it’s fun. Because it’s cheaper than outages.

FAQ

1) Is Get‑Counter accurate enough for real incident response?

Yes. It’s reading the same performance counter infrastructure PerfMon uses. The usual failure mode isn’t accuracy; it’s interpretation (wrong counter, wrong instance, wrong sampling interval).

2) Should I use CookedValue or RawValue?

Use CookedValue for most operational work. RawValue is for when you’re implementing your own math or validating counter types. In production triage, you want clarity.

3) What sampling interval should I use?

During triage: 1–2 seconds for 30–120 samples. For baseline: 5–15 seconds over 10–30 minutes. For long-term trending: 30–60 seconds is fine, but accept that you’ll miss micro-spikes.

4) Why does remote Get‑Counter fail when WinRM works?

Because performance counters and WinRM are different plumbing. Firewall rules, permissions, Remote Registry/service dependencies, or hardened policies can block counter collection even if you can remote PowerShell.

5) Is “Avg. Disk sec/Read” the same as disk latency?

Effectively yes: it’s average service time per read, in seconds, as observed by the OS. Multiply by 1000 to think in milliseconds. Track both read and write; they fail differently.

6) What’s a “good” disk latency value?

It depends on workload, but as a working SRE heuristic: single-digit milliseconds is healthy for many server workloads; tens of milliseconds is concerning; hundreds is an incident. Always compare to your baseline.

7) Why does CPU look low when users are timing out?

Because waiting is not CPU. Threads blocked on disk, network, locks, or paging don’t burn CPU. That’s why you look at queue length, paging, and I/O latency together.

8) Can I use Get‑Counter as a lightweight monitoring agent?

Yes, with discipline. Keep the counter set small, sampling sane, and outputs structured (BLG/CSV). Don’t poll hundreds of counters every second on production boxes and then act surprised when you add overhead.

9) How do I avoid the “wrong instance name” problem for disks and processes?

Enumerate instances first with PathsWithInstances, then sample the exact paths returned. For processes, prefer metrics tied to service names or IDs when possible, because instance names can change.

Conclusion: next steps you can actually do

  1. Pick your core counters (CPU total + queue, memory available + commit + paging, disk read/write latency per volume).
  2. Run a 2-minute capture during the next complaint instead of staring at Task Manager. Save it as BLG. Evidence first, opinions second.
  3. Baseline one “normal busy” window this week. Without a baseline, you’re guessing with confidence.
  4. Turn one incident lesson into an alert: sustained disk latency, sustained commit pressure, or sustained CPU queue. Choose the one that matches your last outage.
  5. Write down your thresholds as heuristics, not laws. Your environment will teach you what “bad” really looks like.

If you do only one thing: start measuring disk latency and commit pressure routinely. Those two catch a shocking number of “mysterious slow server” tickets before they turn into outages.

← Previous
DNSSEC NSEC3 Myths: When It Helps and When It Just Hurts Performance
Next →
OOBE Loop / “Something went wrong”: Quick Fixes for Setup Disasters

Leave a comment