QoS usually shows up after an incident. The storage replication got “slow,” the call quality turned into underwater jazz, and the CEO’s dashboard froze right when someone asked for it. Then somebody says the magic words: “Can we just set the priority higher?”
You can. And you can also light your network on fire with a well-meaning DSCP value. This is the guide for people who want QoS that you can prove works: measured bottlenecks, explicit trust boundaries, and shaping in the one place that actually matters.
QoS isn’t “priority,” it’s a contract
Real QoS is a contract between competing flows and a bottleneck. It says: “When contention exists, these classes get at least X, at most Y, and latency-sensitive flows don’t drown behind bulk transfers.” That’s it. Everything else is marketing.
Most QoS failures come from one of three errors:
- You shaped in the wrong place. You can’t fix congestion in the core by “prioritizing” packets at the edge if the real queue lives in an upstream ISP box you don’t control.
- You guessed priorities. “Storage is important” isn’t a class definition. It’s a feeling. Feelings don’t survive microbursts.
- You trusted markings you shouldn’t. If untrusted endpoints can mark EF, they will. Sometimes accidentally. Sometimes “because the vendor told us to.”
Good QoS is boring: it’s measurement plus rate control plus fairness. It is not a rainbow of DSCP values plastered across everything that ever dropped a packet.
One paraphrased idea from a person who built a lot of reliable systems: paraphrased idea: “Hope is not a strategy.”
— attributed to Gordon R. Dickson, often repeated in engineering circles.
Joke #1: QoS plans based on “what’s important” are like org charts—beautiful until you try to route traffic through them.
Interesting facts and short history (so you stop repeating it)
Some context helps because networking keeps reinventing the same mistakes with new acronyms.
- IntServ/RSVP came first, and it mostly lost. The 1990s vision was per-flow reservations across the network. It didn’t scale operationally, and DiffServ won by being simpler and more “good enough.”
- DiffServ was designed for aggregates. DSCP wasn’t meant for every app to self-declare importance; it was meant for network domains to classify and treat traffic in bulk.
- Bufferbloat became widely recognized in the late 2000s. Deep buffers in consumer gear made throughput look great in benchmarks while latency quietly exploded in real use.
- RED tried early to avoid queue buildup. Random Early Detection predates modern AQM popularity; it was powerful but finicky, which is why you didn’t see it everywhere.
- FQ-CoDel was a practical turning point. It blended fairness (flow queueing) with AQM (CoDel) and didn’t require per-link tuning magic.
- DSCP rewriting is common in the wild. Many ISPs bleach or remap DSCP at peering edges; QoS assumptions often die at the first external hop.
- Ethernet has its own priority bits. 802.1p PCP can work inside a LAN, but it’s not the same as IP DSCP, and mapping mistakes are a classic outage generator.
- “Priority” queues can starve everything else. Strict priority was always a loaded gun; it’s safe only with policing or caps.
- Wi‑Fi QoS is not wired QoS. 802.11e/WMM prioritization helps, but airtime contention, rate adaptation, and retries often dominate the user experience.
A mental model that survives production
1) Find the bottleneck, then control the queue at that bottleneck
QoS is only meaningful where packets queue. That’s usually at the narrowest link (uplink to ISP, WAN circuit, inter-AZ tunnel, VPN headend, Wi‑Fi airtime). If you don’t control the queue there, you’re negotiating with physics.
Shaping below the real link rate is the key trick. If your WAN is 1 Gbps but the provider’s policer kicks in at 940 Mbps, shape to 900–920 Mbps so your queue builds (where you can schedule fairly) instead of theirs (where you can’t).
2) Separate “latency-sensitive” from “bulk,” but don’t create ten classes
Most shops need three to five classes, not twelve:
- Realtime interactive: voice, video conf media, maybe some gaming-like telemetry. Low jitter budget.
- Interactive: SSH, RDP, small API requests, DNS. Low latency matters more than bandwidth.
- Default: most web and service traffic.
- Bulk: backups, replication, artifact pulls, large exports.
- Scavenger (optional): things you want to finish eventually but never at the expense of humans.
If you can’t articulate a class with a policy statement (“at least X, at most Y, latency target Z”), don’t create it. You’re not designing a subway map. You’re preventing fistfights at a single doorway.
3) Trust boundaries: endpoints lie, even when they don’t mean to
Marking can be done at endpoints, but enforcement must be done at a boundary you control: ToR switch, host vSwitch, edge router, WAN gateway. The network should treat endpoint markings as hints unless the endpoint is managed and audited.
4) Fairness is not optional
One elephant flow can ruin your day without using much bandwidth on average. Microbursts and queue buildup create latency spikes that look like “random slowness.” Fair queueing (FQ) isolates flows so that one bulk transfer doesn’t sit in front of everyone else like a truck blocking a one-lane bridge.
5) QoS is a feedback loop: measure, apply, verify, repeat
If you’re not looking at drops, ECN marks, queue delay, and class utilization, you’re doing interpretive dance with packet headers.
Fast diagnosis playbook
This is the “pager is screaming” version. The goal is to identify whether you have a capacity problem, a queueing problem, a classification problem, or a path problem in under 15 minutes.
First: confirm the symptom is real and locate the scope
- Is it latency, loss, or throughput? Ask for one concrete metric: RTT p95, packet loss %, goodput Mbps. “Slow” is not a metric.
- Is it one site, one VLAN, one VPN, one ISP, one host? Narrow the blast radius fast.
- Is it tied to load? If it correlates with backup windows or deploys, you already have a suspect class.
Second: find the bottleneck link and check if it’s queueing locally or upstream
- Check interface utilization and errors on the suspected egress.
- Check queue discipline stats (drops/overlimits) where you shape.
- Compare measured throughput vs contracted rate. If you’re hitting a provider policer, you’ll see loss without local drops unless you shape below it.
Third: validate classification and trust boundaries
- Capture a small sample and confirm DSCP/PCP are what you think.
- Verify that devices in the path preserve markings. One “helpful” switch can rewrite everything to best-effort.
- Confirm your scheduler is not strict-priority starving default traffic.
Fourth: decide which lever to pull
- If queueing is upstream: shape lower so your queue builds locally.
- If queueing is local and latency spikes: enable FQ + AQM (fq_codel/cake) for the affected egress.
- If the wrong traffic is in the wrong class: fix classification at the boundary (iptables/nftables, switch policy, or kube CNI policy).
- If you’re simply out of capacity: QoS can triage, not create bandwidth. Buy/upgrade or reduce load.
Where QoS actually works (and where it’s theater)
The only place scheduling matters: the egress queue
QoS works best on egress because that’s where the device controls transmission timing. Ingress QoS is mostly about policing (dropping) or remarking. You can’t “delay” inbound packets that already arrived; you can only drop them or rely on upstream to slow down.
Best places to enforce
- Internet/WAN edge router: your last chance before the provider. Shape here. Put the “smart queue” here.
- VPN gateway: encryption hides L4 ports; classification must happen before encapsulation (or based on inner headers if supported).
- Hypervisor/host: per-VM and per-pod fairness prevents one tenant from becoming the office vacuum cleaner.
- Wi‑Fi controller/AP: airtime fairness and WMM mapping can matter more than DSCP purity.
Places people try, and usually fail
- Core switches when the bottleneck is a WAN circuit. You’re prioritizing into a black hole. The queue is elsewhere.
- Random middleboxes with unknown buffer behavior. If you can’t observe queue delay, you’re tuning blind.
- Application-side “priority” knobs with no network enforcement. Congrats, you set a label.
Joke #2: Strict priority queues are like free pizza at an incident review—somebody always takes too much and the rest of the room gets resentful.
Classification without guessing: mark less, trust less
Start with “default” and “bulk,” then earn your way to “realtime”
The safest QoS strategy is not to divine which apps are “important.” It’s to separate traffic by behavior and harm potential:
- Bulk is identifiable: long-lived, high-BDP transfers, backups, replication, container image pulls, object storage sync. It can be delayed.
- Interactive is small and bursty: DNS, SSH, API calls. It needs low queueing delay, not massive bandwidth.
- Realtime is sensitive to jitter: voice/video media streams. It needs both low delay and bounded loss.
Notice what’s missing: “CEO traffic.” Don’t do that.
DSCP strategy: choose a tiny vocabulary
In practice, a small DSCP set is robust across devices:
- CS0 for default (best effort)
- AF21/AF31 for interactive (pick one and stick to it)
- EF for realtime media (use sparingly, police it)
- CS1 for scavenger/bulk-low (sometimes called “lower than best effort”)
If your network team already has a DSCP plan, don’t freestyle. Align and document the trust boundary: where marks are accepted, where they are rewritten, and where they are ignored.
Trust boundary patterns that don’t age badly
- Enterprise LAN: trust DSCP from managed voice/video endpoints; remark everything else at access/ToR.
- Data center: mark at workload edge (host/vSwitch) based on cgroup/pod identity; don’t trust guest VMs.
- Hybrid cloud: assume DSCP will be bleached at some boundary; enforce fairness with shaping + FQ regardless.
Policing: the price of strict priority
If you have a strict priority queue (for voice, for example), you must police it or cap it. Otherwise, a mis-marked bulk flow can starve everything. This is how “QoS made it worse” incidents are born.
Shaping and queueing: pick the right knife
Shaping vs policing
- Shaping delays packets to fit a rate. It reduces loss, improves predictability, and makes TCP behave. It needs buffer and adds some latency, but controlled latency beats random latency.
- Policing drops packets exceeding a rate. It’s simple and harsh. Use it to enforce boundaries (especially on priority classes), not to “manage” normal traffic.
Queue disciplines that matter on Linux
On Linux, your default toolbox looks like this:
- fq_codel: strong default for latency control on typical links; great for “make the network feel normal again.”
- cake: excellent all-in-one for shaping + fairness + diffserv handling; especially popular at edges. Not always available in enterprise kernels, but when it is, it’s a gift.
- HTB: classful shaping with explicit rates/ceilings; operationally stable when you keep class count sane.
- TBF: simple token bucket; fine for single-rate shaping but not for multi-class fairness.
What I recommend for most production edges
- If you can use cake: shape to ~90–95% of real link rate and use diffserv4 or diffserv8 depending on your class model.
- If you can’t: use HTB for shaping + class guarantees, and put fq_codel (or fq) under each class for fairness.
Why “priority” is the wrong default
Strict priority is appropriate for a small, policed realtime class. It is not a general “make this faster” button. If you make everything high priority, you’ve invented best-effort again, just with more steps and worse failure modes.
Storage-specific note: replication is bulk, but it’s also a pager risk
Replication traffic is typically bulk and tolerant of latency, but it can create secondary incidents when it saturates a link and causes timeouts elsewhere. Treat it as bulk with a reasonable minimum rate and a maximum cap. You want it reliable, not dominant.
Practical tasks: commands, outputs, decisions (12+)
These are the tasks you actually run during design, rollout, and debugging. Each includes what the output means and the decision you make from it.
Task 1: Confirm link speed/duplex and spot autonegotiation pain
cr0x@server:~$ sudo ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes: 10000baseT/Full
Advertised auto-negotiation: Yes
Speed: 10000Mb/s
Duplex: Full
Auto-negotiation: on
Link detected: yes
Meaning: If you see 100Mb/s or Half duplex, your “QoS problem” is actually a physical/link negotiation problem.
Decision: Fix link speed/duplex first. QoS cannot save a half-duplex link from collisions and tears.
Task 2: Check interface errors and drops (driver/NIC/path issues)
cr0x@server:~$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
987654321 123456 0 234 0 1234
TX: bytes packets errors dropped carrier collsns
876543210 234567 0 12 0 0
Meaning: RX drops can indicate upstream congestion or host inability to process packets. TX drops often indicate qdisc drops or ring buffer pressure.
Decision: If errors/carrier issues exist, fix physical/driver. If drops correlate with load, move on to queue stats and shaping.
Task 3: Identify the current qdisc and whether you already have FQ/AQM
cr0x@server:~$ tc -s qdisc show dev eth0
qdisc mq 0: root
qdisc fq_codel 8012: parent :12 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
Sent 123456789 bytes 234567 pkt (dropped 123, overlimits 0 requeues 10)
backlog 0b 0p requeues 10
maxpacket 1514 drop_overlimit 123 new_flow_count 456 ecn_mark 789
Meaning: You have fq_codel active. Drops indicate queue pressure; ECN marks indicate AQM is signaling congestion without dropping (if endpoints support it).
Decision: If you see pfifo_fast or no AQM and you have latency spikes, enabling fq_codel/cake on the bottleneck is a high-leverage move.
Task 4: Measure queueing delay symptoms quickly with ping under load
cr0x@server:~$ ping -c 20 -i 0.2 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.42 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.39 ms
64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=35.12 ms
64 bytes from 10.0.0.1: icmp_seq=4 ttl=64 time=42.77 ms
--- 10.0.0.1 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 3805ms
rtt min/avg/max/mdev = 0.36/6.88/42.77/13.10 ms
Meaning: Max RTT jumps to tens of ms while min stays sub-ms: classic queue buildup (bufferbloat) during contention.
Decision: Add shaping + AQM at the egress bottleneck. Don’t “prioritize” randomly; fix the queue.
Task 5: Check route/path changes (because not all latency is congestion)
cr0x@server:~$ ip route get 8.8.8.8
8.8.8.8 via 192.0.2.1 dev eth0 src 192.0.2.10 uid 0
cache
Meaning: Confirms which gateway and interface are in play.
Decision: If the “slow” traffic uses a different interface/tunnel than you assumed, your QoS policy might be on the wrong egress entirely.
Task 6: Verify DSCP markings in real traffic (don’t trust configs)
cr0x@server:~$ sudo tcpdump -i eth0 -vv -c 5 'udp and (port 5060 or portrange 16384-32767)'
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:00:01.123456 IP (tos 0xb8, ttl 64, id 1234, offset 0, flags [DF], proto UDP (17), length 214) 192.0.2.50.40000 > 192.0.2.60.20000: UDP, length 172
12:00:01.123789 IP (tos 0x00, ttl 64, id 1235, offset 0, flags [DF], proto UDP (17), length 214) 192.0.2.51.40002 > 192.0.2.60.20002: UDP, length 172
Meaning: One stream is EF (tos 0xb8), another is CS0 (tos 0x00). Your endpoints are inconsistent or policy is rewriting.
Decision: Decide whether to enforce marking at the boundary (recommended) and whether to trust endpoint markings for these sources.
Task 7: Check whether the host is rewriting DSCP (or not)
cr0x@server:~$ sysctl net.ipv4.tcp_ecn net.ipv4.conf.all.rp_filter
net.ipv4.tcp_ecn = 2
net.ipv4.conf.all.rp_filter = 1
Meaning: ECN is enabled (2 = enable and try), rp_filter is strict (1). Strict rp_filter can break asymmetric routing and cause “random” drops.
Decision: If you’re doing policy-based routing, multi-homing, or certain tunnels, consider loosening rp_filter and validating DSCP behavior end-to-end.
Task 8: Inspect iptables mangle rules for marking (legacy but common)
cr0x@server:~$ sudo iptables -t mangle -S
-P PREROUTING ACCEPT
-P OUTPUT ACCEPT
-A OUTPUT -p udp --dport 53 -j DSCP --set-dscp-class AF21
-A OUTPUT -p tcp --dport 22 -j DSCP --set-dscp-class AF21
-A OUTPUT -p tcp --dport 873 -j DSCP --set-dscp-class CS1
Meaning: DNS/SSH marked interactive; rsync marked scavenger. This is behavior-based and sane.
Decision: Keep it small. Avoid giant port lists. If apps move ports (hello, QUIC), you’ll lose classification anyway.
Task 9: Inspect nftables for marking (the modern default)
cr0x@server:~$ sudo nft list ruleset
table inet mangle {
chain output {
type route hook output priority mangle; policy accept;
udp dport 53 ip dscp set af21
tcp dport { 22, 443 } ip dscp set af21
tcp dport 2049 ip dscp set cs1
}
}
Meaning: Output chain marks traffic by port; NFS (2049) set to CS1 here (questionable unless you truly want it scavenger).
Decision: Challenge assumptions. Storage traffic might need a minimum share, not scavenger status, depending on your failure modes.
Task 10: Validate shaping is actually active and at the right rate
cr0x@server:~$ sudo tc qdisc show dev eth0
qdisc htb 1: root refcnt 2 r2q 10 default 30 direct_packets_stat 0 direct_qlen 1000
qdisc fq_codel 10: parent 1:10 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
qdisc fq_codel 20: parent 1:20 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
qdisc fq_codel 30: parent 1:30 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
Meaning: HTB root with three classes, fq_codel in each. This is a solid baseline.
Decision: Now verify class rates/ceilings and that default traffic goes to the intended class.
Task 11: Read HTB class stats to see who is winning and who is starving
cr0x@server:~$ sudo tc -s class show dev eth0
class htb 1:10 root rate 50Mbit ceil 200Mbit burst 15Kb cburst 15Kb
Sent 98765432 bytes 123456 pkt (dropped 0, overlimits 0 requeues 0)
class htb 1:20 root rate 200Mbit ceil 900Mbit burst 15Kb cburst 15Kb
Sent 876543210 bytes 234567 pkt (dropped 0, overlimits 0 requeues 0)
class htb 1:30 root rate 50Mbit ceil 900Mbit burst 15Kb cburst 15Kb
Sent 4567890123 bytes 3456789 pkt (dropped 4321, overlimits 9876 requeues 123)
Meaning: Class 1:30 (likely bulk/default depending on mapping) is hitting overlimits and drops, indicating it’s being constrained and queueing. Others are fine.
Decision: If 1:30 is bulk, good. If it’s actually default user traffic, you misclassified and are punishing the wrong thing.
Task 12: Check conntrack and see if a “small number of flows” dominate
cr0x@server:~$ sudo conntrack -S
cpu=0 found=123456 invalid=12 ignore=0 insert=34567 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=123
Meaning: Not a QoS metric directly, but invalid/insert_failed spikes can indicate host overload causing drops that look like “network QoS problems.”
Decision: If conntrack is stressed, fix that (tuning, offload, capacity) before blaming queue disciplines.
Task 13: Confirm NIC queueing and multiqueue behavior
cr0x@server:~$ sudo ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 8
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 8
Meaning: 8 combined channels suggests multi-queue; qdisc configuration must account for mq. Some shaping setups only affect one queue unless applied correctly.
Decision: If you’re shaping, ensure it’s at the root (or use an IFB/ingress strategy if needed). Misapplied shaping is silent failure.
Task 14: Find the “real” egress in a host with VLANs/bonds
cr0x@server:~$ ip -d link show bond0
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:aa:bb:cc brd ff:ff:ff:ff:ff:ff
bond mode 802.3ad miimon 100 updelay 0 downdelay 0
Meaning: Bonded interface. If you attach qdisc to the wrong layer (bond vs VLAN subif vs physical), results vary.
Decision: Enforce QoS at the interface that actually queues. Often that’s the bond or the physical member depending on driver and offloads.
Task 15: Confirm whether GRO/GSO/TSO offloads are masking queue behavior
cr0x@server:~$ sudo ethtool -k eth0 | egrep 'gro|gso|tso'
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
Meaning: Offloads can reduce CPU and change packetization; shaping and classification based on packet sizes can behave differently.
Decision: Usually leave offloads on for servers, but if you’re debugging weird qdisc behavior, test with them toggled (carefully) in a maintenance window.
Task 16: Prove that a device in the path is bleaching DSCP
cr0x@server:~$ sudo tcpdump -i eth0 -vv -c 3 'icmp and host 198.51.100.10'
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:05:00.000001 IP (tos 0x88, ttl 64, id 7000, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.10 > 198.51.100.10: ICMP echo request, id 1, seq 1, length 64
12:05:00.010001 IP (tos 0x00, ttl 51, id 8000, offset 0, flags [none], proto ICMP (1), length 84) 198.51.100.10 > 192.0.2.10: ICMP echo reply, id 1, seq 1, length 64
Meaning: You sent with tos 0x88 (a DSCP/ECN combination); the reply is CS0. Not definitive by itself, but if you also capture on the far side and see reset DSCP, something rewrites.
Decision: Stop relying on DSCP across that boundary. Use shaping/fairness that doesn’t depend on preserved markings.
Three corporate mini-stories from the QoS trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized fintech had a “voice VLAN” and a long-standing belief: voice packets are sacred. They rolled out a new softphone client and a new switch model in the same quarter. The client marked media as EF, signaling “expedited forwarding.” The switches were configured to trust DSCP on access ports because that’s what the old phones did.
Then an unrelated team deployed a high-throughput telemetry agent that, due to a vendor default, also marked its UDP streams as EF. Nobody noticed because everything worked fine at low load, and the telemetry platform’s dashboards looked impressively “real time.”
On a Monday morning, a backlog in telemetry caused sustained EF-marked traffic. The distribution switches had strict priority queues for EF with no policing—because voice “never uses that much bandwidth.” That assumption was true for desk phones. It was false for software on general-purpose hosts.
The result was surgical: voice calls sounded okay, telemetry streamed like a champ, and everything else degraded. DNS slowed. API calls timed out. The helpdesk blamed the firewall, the firewall team blamed the ISP, and the ISP asked for packet captures that nobody could interpret under pressure.
The fix was boring and immediate: stop trusting DSCP on general access ports, remark EF only for authenticated voice endpoints, and police EF to a maximum percentage per port. Voice stayed clean, telemetry still worked, and default traffic stopped suffocating.
Mini-story 2: The optimization that backfired
A retail company had a private WAN with multiple sites and a nightly backup window. Backups were chewing up bandwidth, so the network team created a bulk class and aggressively capped it. Great intent: protect business traffic during the day. They deployed it everywhere, including the data center uplinks.
Backups slowed down, which everyone expected. Then a few days later, a database replica began lagging. Not enough to alert immediately, but enough to push replication behind. A failover test happened that week (routine, supposedly), and it took longer than anyone was comfortable admitting.
What happened: the replication and backup traffic shared ports and patterns in a way the classifier couldn’t distinguish. The “bulk cap” applied to more than backups. The system didn’t break loudly; it degraded quietly. That’s the worst kind.
They tried to “fix” it by creating more classes: one for backups, one for replication, one for storage metadata, one for everything else. It got worse because the classifier accuracy dropped and the policy became impossible to reason about. A month later, nobody was sure which class a given flow belonged to, which meant nobody could predict behavior under contention.
The real fix: reduce class count, classify replication by endpoint identity (backup servers vs replication nodes), and give replication a minimum share with a reasonable ceiling. Also: shape at the WAN edges, not at random data center uplinks where congestion wasn’t the bottleneck.
Mini-story 3: The boring but correct practice that saved the day
A SaaS company ran multi-tenant Kubernetes clusters with periodic “noisy neighbor” incidents. Early on, they decided on a dull rule: no tenant-controlled DSCP. All pod egress was remarked at the node based on namespace and a small set of service labels. They documented the mapping and enforced it in code review.
They also shaped the internet egress on each cluster’s NAT gateway to slightly below the provider’s known policer rate. Under that shaping, they used fq_codel for fairness and kept exactly four classes: realtime (rare), interactive, default, bulk. Bulk had a ceiling, interactive had a small guarantee, and realtime was strictly policed.
One day, a tenant pushed a “data export” feature that started saturating egress. Support saw a spike in UI latency and feared a repeat incident. But the graphs told the story clearly: bulk class utilization pinned at its ceiling, interactive stayed stable, and default degraded only slightly.
The on-call didn’t scramble to invent new priorities. They told the tenant, honestly, “Your job is in bulk class, it’s capped by design, and it’ll finish in N hours.” Then they helped the tenant schedule exports and offered a higher tier with higher bulk ceilings. No crisis, no midnight policy change, no mystery.
This is what “QoS that works” looks like: a policy you can explain to a tired engineer at 3 a.m., with instrumentation that confirms the policy is doing what you intended.
Common mistakes: symptom → root cause → fix
1) “QoS is enabled but latency still spikes during uploads”
Symptom: p95 latency jumps when someone saturates uplink; packet loss appears at ISP edge; local device shows few drops.
Root cause: The queue is upstream (provider policer / cable modem / unmanaged CPE). You’re not controlling the bottleneck queue.
Fix: Shape egress to below the real enforced rate (often 90–95%). Use cake or HTB+fq_codel so your device owns the queue.
2) “We set everything to high priority and now nothing works”
Symptom: Default traffic starves; random timeouts; voice is fine; monitoring says link isn’t full.
Root cause: Strict priority queue is unpoliced; mis-marked traffic floods it and starves other queues.
Fix: Police/cap the strict priority class. Reduce who can mark EF. Consider WRR/DRR style scheduling instead of strict priority except for small realtime.
3) “After enabling QoS, throughput dropped by 30%”
Symptom: Bulk transfers slower than expected even with empty link; CPU on router/host increased.
Root cause: Shaping rate set too low, or qdisc features (cake/fq_codel) running on underpowered hardware, or offload interactions.
Fix: Validate real link rate; adjust shaper; ensure hardware capacity; consider moving shaping to a box that can handle it; keep rules simple.
4) “Only some applications get the intended treatment”
Symptom: Same service behaves differently across hosts/subnets; packet captures show DSCP inconsistent.
Root cause: Multiple marking points (endpoints + firewall + switch) rewriting DSCP; inconsistent trust boundary.
Fix: Choose one marking authority per boundary. Document it. Enforce: either trust managed endpoints or remark at ingress to your domain.
5) “QoS works on wired but Wi‑Fi users still complain”
Symptom: Wired calls fine; Wi‑Fi calls jittery; packet loss spikes with more clients.
Root cause: Airtime contention and retries dominate; WMM mapping wrong or not enforced; AP buffers bloated.
Fix: Tune Wi‑Fi separately: enable WMM, ensure DSCP-to-AC mapping is correct, reduce bufferbloat on WLAN edge, prefer 5/6 GHz, manage client density.
6) “We marked traffic but switches ignore it”
Symptom: DSCP present on packets but no change in behavior; switch QoS counters don’t move.
Root cause: DSCP trust disabled on ports, or DSCP-to-queue mapping not configured, or hardware limitations.
Fix: Confirm QoS is enabled globally; configure trust boundary; map DSCP/PCP to queues; verify with counters and test congestion.
7) “Interactive traffic is still laggy under bulk load”
Symptom: SSH stalls when backups run; overall utilization not maxed.
Root cause: No fairness at bottleneck queue (pfifo), or interactive is in same queue as bulk, or class has no minimum share.
Fix: Use fq_codel/cake; ensure interactive class has at least a small guaranteed rate; verify classification with packet capture.
Checklists / step-by-step plan
Step-by-step: implement QoS you can defend in a review
- Identify the bottleneck link(s). Internet egress, WAN circuits, VPN tunnels, Wi‑Fi controller uplinks. If you don’t know, measure utilization and latency under load.
- Decide your class model (3–5 classes). Write a policy sentence for each: min share, max cap, and why.
- Define trust boundaries. Where DSCP/PCP is trusted; where it is overwritten; where it is ignored.
- Choose queueing/shaping mechanism per boundary. Cake if available; otherwise HTB+fq_codel. Don’t mix exotic schedulers unless you can observe and rollback.
- Set shaping rate below enforced rate. Start at 90–95% of measured sustainable throughput; adjust after verification.
- Police strict priority classes. EF gets a cap. Always. Otherwise you’re writing an outage into your config.
- Implement classification at one place. Prefer node/ToR/edge over application. Keep rules short and identity-based when possible.
- Instrument. Track class utilization, drops, ECN marks, queue backlog, and end-user latency SLOs.
- Test under controlled contention. Generate bulk load; verify interactive/realtime remain stable; capture packets to confirm markings and mappings.
- Document the “why.” Include examples: “Backups are CS1, capped at X, scheduled after hours.” Future you will forget. Present you will resign.
- Roll out gradually. One site/link at a time. Keep a rollback command ready.
- Run a game day. Intentionally saturate; confirm that the system fails gracefully and predictably.
Checklist: what to capture during a QoS incident
- Interface stats (bytes, drops, errors) on suspected egress.
- qdisc stats (drops, overlimits, ECN marks, backlog).
- Latency samples (min/avg/max) during idle and during load.
- Packet capture showing DSCP/PCP for at least one affected flow.
- Path verification (route, tunnel state, NAT gateway used).
- Change history: what was deployed/changed in the last 24 hours.
FAQ
1) Should I prioritize TCP ACKs?
Sometimes. On very asymmetric links (tiny uplink, big downlink), prioritizing ACKs can improve downstream throughput and reduce stalls. But don’t guess: first confirm uplink saturation is causing downstream collapse. If you use cake, it already handles a lot of this well.
2) Is DSCP enough, or do I need shaping?
DSCP alone is a request. Shaping is enforcement. If you don’t control the bottleneck queue, DSCP is mostly vibes. Use DSCP for classification within your domain, but assume it may be bleached outside.
3) Why not just make everything “AF41” and call it a day?
Because contention still exists. If everything is premium, nothing is. Worse, you lose the ability to quarantine bulk traffic, which is the main win for user experience.
4) Can QoS fix packet loss from a bad cable or optics?
No. That’s not congestion; it’s physics, hardware, or configuration. Fix layer 1/2 issues first. QoS can only manage contention on a working link.
5) How many classes should I run?
Three to five for most organizations. More classes increase operational risk: misclassification, unexpected starvation, and policies nobody can reason about during incidents.
6) Where should I mark traffic in Kubernetes?
At the node boundary (host network) based on namespace/workload identity, not inside pods. Pods are too easy to misconfigure, and multi-tenant clusters demand a hard trust boundary.
7) Does ECN replace QoS?
No. ECN helps endpoints respond to congestion without drops, but it doesn’t define who gets bandwidth under contention. ECN plus FQ/AQM is excellent; ECN alone doesn’t allocate fairly across traffic types.
8) What’s the simplest “make it better today” change?
Enable fq_codel (or cake) on the real bottleneck egress and shape slightly below the enforced rate. This often reduces latency spikes dramatically without elaborate classification.
9) Should storage replication ever be “high priority”?
Rarely. Replication needs reliability and a minimum share, not strict priority. If you prioritize it above interactive traffic, you’re optimizing for the wrong kind of uptime.
10) How do I prove QoS is working?
Reproduce contention and show: (a) class counters change as expected, (b) interactive latency remains stable, (c) bulk takes longer but completes, (d) drops/queue delay are controlled at your shaper, not upstream.
Conclusion: next steps you can do this week
If your QoS strategy is “set higher priority,” you’re one mis-marked flow away from a strange and time-consuming incident. The alternative is not complicated, just disciplined: control the bottleneck queue, keep classes few, enforce trust boundaries, and measure what your scheduler is actually doing.
Practical next steps:
- Pick one known bottleneck (internet egress or WAN edge) and implement shaping at 90–95% of observed sustained throughput.
- Enable fq_codel or cake on that egress and capture qdisc stats before/after under load.
- Define a minimal DSCP plan (CS0, one AF for interactive, EF for small policed realtime, CS1 for scavenger) and document where you trust it.
- Run a controlled saturation test and prove interactive latency stays sane.
- Write down the rollback plan. Not because you will fail, but because you want to sleep.
Do this, and QoS becomes what it should have been all along: predictable behavior during contention, not a network Ouija board.