Proxmox VLAN Not Working: Trunk Ports, Linux Bridge Tagging, and “No Network” Fixes

Was this helpful?

You tagged the VM. You tagged the bridge. You tagged the switch. And your guest still can’t ping its gateway—except maybe on Fridays, and only from one node. Welcome to VLAN troubleshooting in Proxmox: half networking, half archaeology, and one part “why does this only break after reboot?”

This is a practical, production-minded guide to finding the actual fault line: trunk ports that aren’t really trunks, Linux bridge VLAN filtering that’s half-enabled, a PVID that quietly rewrites frames, and the classic Proxmox “no network” symptom that’s really an ARP or MTU failure in disguise.

The mental model that stops the guessing

When VLANs “don’t work” in Proxmox, it’s rarely mysterious. It’s usually one of these:

  • Frames aren’t tagged when you think they are. The VM sends untagged, the bridge expects tagged, or vice versa.
  • Frames are tagged, but dropped. The bridge VLAN table, switch allowed list, or upstream ACL drops them.
  • Frames pass, but L2/L3 breaks. ARP isn’t answered, gateway is on a different VLAN, or you’re hitting asymmetric routing.
  • Everything works… until it doesn’t. A reboot flips interface order, VLAN filtering toggles, LACP renegotiates, or the switch port template “helpfully” resets.

Think in layers, not vibes:

  1. L1: link up, speed/duplex, optics, bonding status.
  2. L2: VLAN tags, bridge VLAN table, MAC learning, STP state.
  3. L3: IP addressing, gateway, ARP/ND, routing, firewall.

Proxmox complicates this because it blends a server’s host networking (management, storage) with a hypervisor’s guest switching (VM NICs). Linux bridges are real switches with real forwarding tables. Treat them that way.

One quote worth keeping on your wall: “Hope is not a strategy.” — General Gordon R. Sullivan. Networking doesn’t reward optimism.

Fast diagnosis playbook (first/second/third)

First: prove whether tags exist on the wire

Your first job is to stop theorizing and observe reality.

  • On the Proxmox host, run tcpdump on the physical NIC or bond and look for vlan in frames.
  • If tags are missing: it’s a Proxmox/bridge/VM config problem.
  • If tags are present: it’s a switch trunk/allowed VLAN/PVID/native VLAN problem, or upstream policy.

Second: confirm bridge VLAN filtering and membership

Linux bridges can either ignore VLAN tables (classic mode) or enforce them (VLAN-aware). Mixing expectations here causes “no network” with perfect-looking configs.

  • Check vlan_filtering on the bridge and the VLAN membership per port.
  • Verify which port is the uplink, which is the VM tap, and what VLAN IDs are allowed.

Third: validate L3 basics and the boring stuff

Once L2 is right, the remaining failures are usually:

  • Wrong gateway or wrong subnet mask in the guest.
  • ARP failure due to firewall, duplicate IP, or upstream device answering ARP on the wrong VLAN.
  • MTU mismatch—especially with VLAN + bond + storage networks.

If you only remember one thing: the fastest path is “is the tag present?” → “is it permitted?” → “is ARP answered?”

Interesting facts and context (yes, it matters)

These aren’t trivia for trivia’s sake. Each one maps to a real failure mode I’ve seen in production.

  1. 802.1Q VLAN tagging adds 4 bytes to an Ethernet frame. That tiny header is why MTU problems show up “only on VLANs.”
  2. Linux bridging is decades old and predates many modern “SDN” wrappers. Under Proxmox, it’s still the Linux kernel doing switching.
  3. VLAN-aware bridge mode is relatively new compared to classic bridging and behaves like a managed switch: if a VLAN isn’t in the table, it can be dropped silently.
  4. The “native VLAN” concept is a switch convention, not a VLAN feature. Misaligned native VLANs cause untagged traffic to land in the wrong place without any errors.
  5. Early VLAN deployments were often driven by broadcast containment, not security. Today people treat VLANs like security zones; that’s only true if you enforce L3/L4 policy.
  6. LACP doesn’t “load balance” per packet by default on many systems; it hashes flows. That’s why “one VM is slow, one is fine” happens on bonded trunks.
  7. STP and VLANs are intertwined: depending on switch mode (PVST/RPVST/MST), a VLAN can be blocked while others forward, creating “only VLAN 30 is dead.”
  8. ARP is chatty and fragile. Many “VLAN broken” incidents are actually ARP replies filtered by host firewall, upstream security, or duplicate IPs.
  9. Proxmox historically leaned on ifupdown-style config; newer systems use ifupdown2 and can apply changes live, but not all changes are equally safe without a maintenance window.

Hands-on tasks: commands, outputs, decisions (12+)

These are the tasks I run, in roughly this order, when a VM says “no network” on a VLAN. Each one includes: command, what typical output means, and what decision you make next.

Task 1: Confirm the host sees link and which NIC is which

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
enp3s0           UP             3c:ec:ef:11:22:33 <BROADCAST,MULTICAST,UP,LOWER_UP>
enp4s0           DOWN           3c:ec:ef:11:22:34 <BROADCAST,MULTICAST>
bond0            UP             3c:ec:ef:aa:bb:cc <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP>
vmbr0            UP             3c:ec:ef:aa:bb:cc <BROADCAST,MULTICAST,UP,LOWER_UP>

Meaning: bond0 and vmbr0 are up; one physical NIC is down. If you thought you had two active links, you don’t.

Decision: Fix link/bond health before touching VLAN config. VLAN debugging on a flapping uplink is self-harm.

Task 2: Check bonding health (LACP vs active-backup and which slave is active)

cr0x@server:~$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: enp3s0
MII Status: up
Actor Churn State: churned
Partner Churn State: churned

Slave Interface: enp4s0
MII Status: down

Meaning: LACP is configured, but one slave is down and churn states suggest negotiation changes.

Decision: Check switch LACP config and cabling. If the switch expects a static trunk but host is LACP, you’ll get intermittent VLAN weirdness.

Task 3: Inspect Proxmox network config (source of truth)

cr0x@server:~$ sed -n '1,200p' /etc/network/interfaces
auto lo
iface lo inet loopback

auto bond0
iface bond0 inet manual
        bond-slaves enp3s0 enp4s0
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4

auto vmbr0
iface vmbr0 inet static
        address 10.10.10.11/24
        gateway 10.10.10.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

Meaning: VLAN-aware bridge enabled; uplink is bond0; management IP is untagged on vmbr0 (so it’s on whatever untagged VLAN the trunk/native VLAN lands in).

Decision: If your management network is supposed to be VLAN-tagged, move it to a subinterface (e.g., vmbr0.10) or set correct PVID expectations. Untagged management on a trunk is how you get surprised later.

Task 4: Confirm bridge VLAN filtering is actually on

cr0x@server:~$ ip -d link show vmbr0
7: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 3c:ec:ef:aa:bb:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    bridge forward_delay 0 hello_time 2 max_age 20 ageing_time 300 stp_state 0 priority 32768 vlan_filtering 1 vlan_protocol 802.1Q

Meaning: vlan_filtering 1 means the bridge will enforce VLAN membership rules.

Decision: You must verify the VLAN table next. In VLAN-aware mode, “it’s connected to vmbr0” is not enough.

Task 5: Dump VLAN membership per bridge port

cr0x@server:~$ bridge vlan show
port              vlan-id
bond0             1 PVID Egress Untagged
bond0             10
bond0             20
tap100i0          10 PVID Egress Untagged
tap101i0          20 PVID Egress Untagged

Meaning: The uplink bond0 has VLANs 10 and 20 tagged, and VLAN 1 untagged (native). VM tap ports are untagged members with a PVID set per VM.

Decision: If the VM is meant to tag itself (guest VLAN tagging), then its tap port should typically carry VLANs tagged, not be PVID-untagged. Decide who tags: host or guest. Don’t do both.

Task 6: Identify which tap device maps to which VM

cr0x@server:~$ qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
       100 app-01               running    4096              32.00 2101
       101 db-01                running    8192             128.00 2255

Meaning: VMIDs 100/101 likely correspond to tap100i0 and tap101i0.

Decision: If the wrong VM is in the wrong VLAN, verify Proxmox VM hardware config: VLAN tag on the NIC, bridge selection, firewall flag.

Task 7: Confirm the VM NIC is tagged as expected in Proxmox

cr0x@server:~$ qm config 100 | sed -n '1,120p'
boot: order=scsi0;net0
memory: 4096
name: app-01
net0: virtio=DE:AD:BE:EF:10:00,bridge=vmbr0,tag=10,firewall=0
scsi0: local-lvm:vm-100-disk-0,size=32G

Meaning: Proxmox is doing the tagging for VM 100 (VLAN 10). The guest should be configured as an untagged NIC.

Decision: If the guest also tags VLAN 10 internally, you’ve just double-tagged. Fix one side.

Task 8: Capture traffic on the uplink and verify VLAN tags

cr0x@server:~$ sudo tcpdump -eni bond0 -c 8 '(arp or icmp) and (vlan 10)'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:01:10.112233 DE:AD:BE:EF:10:00 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), vlan 10, ethertype ARP (0x0806), Request who-has 10.10.10.1 tell 10.10.10.50, length 46
12:01:10.113005 00:11:22:33:44:55 > DE:AD:BE:EF:10:00, ethertype 802.1Q (0x8100), vlan 10, ethertype ARP (0x0806), Reply 10.10.10.1 is-at 00:11:22:33:44:55, length 46

Meaning: VLAN 10 tags are present on the wire. ARP is answered. L2 is likely fine for this VLAN.

Decision: If the VM still can’t reach anything, look at guest IP config, host firewall, or routing upstream. If you see ARP requests but no replies, check switch VLAN membership or gateway presence.

Task 9: Capture on the VM tap to see what the guest “sees”

cr0x@server:~$ sudo tcpdump -eni tap100i0 -c 6 arp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on tap100i0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:02:44.000111 DE:AD:BE:EF:10:00 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), Request who-has 10.10.10.1 tell 10.10.10.50, length 46
12:02:44.000890 00:11:22:33:44:55 > DE:AD:BE:EF:10:00, ethertype ARP (0x0806), Reply 10.10.10.1 is-at 00:11:22:33:44:55, length 46

Meaning: The guest-facing tap is untagged (expected, since Proxmox tags on egress to the uplink). ARP works.

Decision: If ARP works here but not on bond0, tagging/bridge VLAN rules are wrong. If ARP works on both, focus on L3.

Task 10: Verify MAC learning and forwarding database (FDB) per VLAN

cr0x@server:~$ bridge fdb show br vmbr0 | head -n 12
00:11:22:33:44:55 dev bond0 vlan 10 master vmbr0 permanent
de:ad:be:ef:10:00 dev tap100i0 master vmbr0
de:ad:be:ef:20:00 dev tap101i0 master vmbr0

Meaning: The gateway MAC is learned/forced on VLAN 10 via bond0. The VM MAC is learned on the tap.

Decision: If the gateway MAC is missing, the switch/gateway may not be reachable on that VLAN. If VM MAC doesn’t show up, the VM isn’t transmitting or is connected to the wrong bridge.

Task 11: Check for host firewall or bridge netfilter surprises

cr0x@server:~$ sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.bridge.bridge-nf-call-arptables
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-arptables = 0

Meaning: Bridged traffic can traverse iptables rules. This is sometimes desired, sometimes a silent packet shredder.

Decision: If you don’t intentionally filter bridged traffic, consider disabling bridge-nf calls or auditing rules. If you do filter, confirm rules include VLAN interfaces/taps correctly.

Task 12: Validate MTU end-to-end (host, bridge, VM path)

cr0x@server:~$ ip -br link show bond0 vmbr0
bond0            UP             3c:ec:ef:aa:bb:cc <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500
vmbr0            UP             3c:ec:ef:aa:bb:cc <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500

Meaning: MTU is 1500. If your upstream is 9000 (jumbo) or your VM expects 1500 but something in between is 1496 due to odd settings, you’ll get weird fragmentation/PMTUD issues.

Decision: Pick one MTU per domain. For VLAN trunks carrying jumbo storage traffic, set MTU consistently on physical NICs, bonds, bridges, and switch ports. Otherwise stick to 1500 and sleep at night.

Task 13: Check routing and neighbor table on the host (for tagged host IPs)

If the Proxmox host itself has a VLAN subinterface (common for management), verify neighbor resolution.

cr0x@server:~$ ip neigh show dev vmbr0
10.10.10.1 lladdr 00:11:22:33:44:55 REACHABLE
10.10.10.50 lladdr de:ad:be:ef:10:00 STALE

Meaning: ARP resolution works on vmbr0. If you see FAILED entries, L2 is broken or filtered.

Decision: If neighbors fail only on a VLAN, revisit bridge VLAN table and switch allowed VLANs. If they fail across the board, check link/bond and switch port state.

Task 14: Verify the guest path isn’t blocked by Proxmox firewall flags

cr0x@server:~$ pvesh get /nodes/server/qemu/100/config | sed -n '1,40p'
boot: order=scsi0;net0
memory: 4096
name: app-01
net0: virtio=DE:AD:BE:EF:10:00,bridge=vmbr0,firewall=0,tag=10

Meaning: Firewall is off for the VM NIC. If it’s on, rules may be dropping DHCP/ARP/ICMP.

Decision: If enabling firewall, explicitly allow DHCP (67/68), ARP (L2), and required protocols, or you’ll debug “VLAN issues” that are really policy.

Proxmox network config patterns that actually work

There are two sane ways to do VLANs for VMs in Proxmox. Pick one. Mixing them is how you create a shrine to packet loss.

Pattern A: Proxmox tags per-VM NIC (recommended for most shops)

This is the “VLAN tag” field in the VM NIC config. The guest sees an untagged NIC. Proxmox puts it on a VLAN by tagging frames on the bridge uplink.

Why it works: You centralize VLAN assignment in the hypervisor. Guests stay simple. Migration between hosts is predictable.

Bridge requirements:

  • bridge-vlan-aware yes on the bridge
  • bridge-vids includes the VLAN IDs you need (or configured via bridge vlan table)
  • Switch port is a trunk carrying those VLANs

Pattern B: Guest tags VLANs (only when you have a reason)

Sometimes a VM is a router, firewall, or appliance that needs multiple VLANs on one NIC. Then the guest tags VLANs itself (e.g., eth0.10, eth0.20).

Host config: VM NIC must not be set to a tag. The tap port should allow tagged VLANs to pass, and you manage allowed VLANs on the bridge port and switch trunk.

Rule of thumb: If you’re not building a router/firewall VM, don’t do guest VLAN tagging. It creates needless complexity and makes “no network” incidents more creative than they need to be.

Host management IP on a VLAN: do it cleanly

If your Proxmox management network is VLAN-tagged (common in corporate environments), don’t rely on “untagged trunk equals the right VLAN.” Make it explicit with a VLAN interface on top of the bridge. Example:

cr0x@server:~$ sed -n '1,120p' /etc/network/interfaces
auto lo
iface lo inet loopback

auto bond0
iface bond0 inet manual
        bond-slaves enp3s0 enp4s0
        bond-miimon 100
        bond-mode 802.3ad

auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 10 20 30

auto vmbr0.10
iface vmbr0.10 inet static
        address 10.10.10.11/24
        gateway 10.10.10.1

Operational payoff: When someone changes the switch “native VLAN,” your host doesn’t teleport into another network. It stays on VLAN 10 because it tags.

Short joke #1: A native VLAN is like “temporary firewall rules”—nobody remembers it exists until it ruins your day.

Don’t overcomplicate: one bridge per uplink is usually enough

People create multiple bridges (vmbr0, vmbr1, vmbr2) for each VLAN because it feels tidy. It also multiplies failure points, makes trunks harder to reason about, and makes migration more fragile.

Use a single VLAN-aware bridge for an uplink and tag at the VM NIC. Add more bridges only when you truly have separate physical domains or need separate MTUs or security properties.

Switch-side trunk sanity checks

Let’s be blunt: half of Proxmox VLAN tickets are switch tickets wearing a Linux hat. You can do everything right on the host and still lose because the trunk isn’t actually trunking.

What you need from the switch port

  • Mode trunk (or equivalent): the port must accept tagged frames.
  • Allowed VLAN list: include the VLANs your VMs need. “All VLANs” is easy but sometimes forbidden by policy.
  • Native VLAN alignment: if you use untagged traffic, define what VLAN it maps to. Better: avoid untagged except for explicit reasons.
  • LACP consistency: if you bond, both sides must agree on LACP vs static aggregation.
  • STP behavior: trunk ports can be blocked if STP sees loops. Watch for per-VLAN blocking modes in some switch families.

The two trunk mismatches that waste the most time

Mismatch 1: Allowed VLANs don’t include your VLAN. Your host tags VLAN 20; switch drops VLAN 20. On the host, you’ll see tags leaving but no replies. Classic.

Mismatch 2: Native VLAN mismatch. Your host sends untagged management traffic expecting VLAN 10; switch maps untagged to VLAN 1. Your host still has link, but it’s on the wrong planet.

Short joke #2: VLAN mismatches are the only place where “it works on my machine” translates to “it’s broken on the switch.”

Common mistakes: symptom → root cause → fix

1) Symptom: VM gets no DHCP lease on VLAN, but static IP also doesn’t ping gateway

Root cause: VLAN not allowed on the switch trunk or bridge VLAN table missing VLAN membership.

Fix: Add VLAN to switch allowed list and ensure bridge-vlan-aware yes plus VLAN present in bridge vlan show. Validate with tcpdump -eni bond0 vlan X.

2) Symptom: VM can ping gateway but not other subnets

Root cause: Wrong guest gateway, missing route upstream, or asymmetric routing caused by multi-homed gateways/firewalls.

Fix: Verify guest default route, then trace from gateway side. If you have multiple VLAN interfaces on a firewall, verify policy and routing tables per VLAN.

3) Symptom: Only one Proxmox node has working VLANs; others don’t

Root cause: Switch ports differ (allowed VLANs, native VLAN, LACP mode), or host configs drifted (bridge VLAN filtering enabled on one but not the other).

Fix: Compare /etc/network/interfaces, ip -d link, and bridge vlan show across nodes. Switch ports should use the same profile.

4) Symptom: VLAN works until VM live-migrates, then dies

Root cause: Destination node’s bridge VLAN table lacks the VLAN, or the uplink trunk differs.

Fix: Treat VLAN membership as cluster-wide intent: standardize bridge config and switch config per node. Test migrations with a canary VM on each VLAN.

5) Symptom: Management access to host disappears after enabling VLAN-aware bridge

Root cause: Host management IP was untagged on the bridge and relied on a native VLAN; enabling filtering changed handling, or the PVID/untagged VLAN isn’t what you thought.

Fix: Move management to a tagged subinterface (vmbr0.10) and ensure switch trunk allows VLAN 10. Schedule a maintenance window and use out-of-band access.

6) Symptom: Some traffic works (small pings), but large transfers stall or hang

Root cause: MTU mismatch with VLAN overhead, or PMTUD blocked by firewall.

Fix: Standardize MTU; allow ICMP fragmentation-needed messages if routing is involved. Test with ping -M do -s from the guest.

7) Symptom: VLAN 1 works, VLAN 10/20 don’t

Root cause: Switch trunk is actually an access port, or allowed VLAN list is wrong.

Fix: Reconfigure switch to trunk and allow VLANs. Verify tags on wire with tcpdump; if you never see VLAN tags leaving, your host/bridge tagging is wrong.

8) Symptom: ARP requests go out, but ARP replies never come back

Root cause: VLAN blocked on trunk, gateway not on that VLAN, or upstream security feature (dynamic ARP inspection, port security) dropping replies.

Fix: Confirm gateway interface VLAN; check switch security features; validate MAC learning. On host, use tcpdump to see if replies arrive on bond0.

Checklists / step-by-step plan

Step-by-step: get one VLAN working end-to-end (repeatable method)

  1. Pick a test VLAN (e.g., VLAN 20) and a test VM with a known-good IP/gateway.
  2. Verify the switch trunk: trunk mode, VLAN 20 allowed, correct native VLAN (or no untagged reliance).
  3. Verify host uplink: link up, bond correct, no flapping.
  4. Verify bridge mode: VLAN-aware enabled if you rely on Proxmox tags.
  5. Verify VM NIC config: tag=20 if host tags; otherwise no tag if guest tags.
  6. Verify the bridge VLAN table includes VLAN 20 on uplink and correct VM tap behavior.
  7. Observe ARP at three points: tap, bridge/uplink, and (if possible) switch/gateway side.
  8. Prove L3 with gateway ping; then prove beyond gateway.
  9. Test MTU if performance or large packets are involved.
  10. Only after it works, scale to more VLANs and automation.

Pre-change checklist (before touching production VLANs)

  • Out-of-band console access works (IPMI/iDRAC/iLO) and credentials are current.
  • Switch port config is backed up or captured (even a pasted config snippet is fine).
  • Maintenance window or rollback plan exists: reverting /etc/network/interfaces and restarting networking may cut you off.
  • Know whether you’re using Proxmox tagging or guest tagging. Write it down.
  • Confirm whether Proxmox firewall is enabled at datacenter/node/VM level and whether bridge netfilter is on.

Post-change checklist (prove it’s actually fixed)

  • VM gets DHCP (if applicable) and can renew lease.
  • VM can ping gateway and at least one IP beyond gateway.
  • Migration test: live-migrate VM to another node and re-test connectivity.
  • Reboot test (if permitted): confirm VLAN survives node reboot.
  • Capture proof: a tcpdump snippet showing VLAN tag present and ARP replies received.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

They were migrating a handful of “low risk” VMs to a new Proxmox cluster. The network team had provided two switch ports per host, bundled as LACP, and said the magic words: “It’s a trunk.” The virtualization team nodded, enabled VLAN-aware bridges, assigned tags per VM, and moved on.

Two hours later, the first migration wave started. Half the VMs came up fine. The rest showed “no network” inside the guest. The team did what teams do under pressure: they changed three things at once—rebooted VMs, reloaded network services, toggled firewall flags—and made the evidence worse.

The wrong assumption was subtle: “trunk” meant “all VLANs” to the virtualization team, but the switch side allowed only a small subset. The working VMs happened to live on VLANs that were allowed; the broken ones didn’t. Nothing on the host looked “down.” Link was up, bonds were up, bridges were up. It was the cleanest failure a network can offer: silent dropping of unauthorized VLAN tags.

The fix was boring and immediate: align allowed VLAN lists on every host port-channel, then standardize a checklist: whenever a new VLAN is introduced, update both the Proxmox bridge VLAN allowance and the switch trunk allowed VLANs as a single change.

They also added a canary VM per VLAN that continuously tested ARP and gateway reachability. It’s not glamorous, but it makes the next wrong assumption loud instead of expensive.

Mini-story 2: The optimization that backfired

A performance-minded engineer decided to “simplify and speed up” by creating separate Linux bridges per VLAN, each bound to the same bond uplink, and then attaching VMs to the specific bridge. The logic sounded tidy: fewer VLAN rules, less complexity, cleaner diagrams. It also made the system harder to operate.

The first problem appeared during maintenance. A switch change temporarily altered the native VLAN on the trunk. One of those bridges carried management untagged—because “management doesn’t need tags, it’s on the native VLAN.” The host management IP ended up in a different VLAN, and the node vanished from monitoring. Not down. Just… relocated.

The second problem arrived with migrations. Some nodes had slightly different bridge definitions, and a VM migrating from node A to node B ended up attached to a bridge that existed but wasn’t plumbed the same way. Connectivity became a per-node personality trait.

In the post-incident cleanup, they merged back to a single VLAN-aware bridge per uplink and used Proxmox’s per-VM tags. They also moved host management to an explicit tagged subinterface. Performance didn’t degrade. Operational clarity improved dramatically.

Mini-story 3: The boring but correct practice that saved the day

Another org had a habit that seemed almost old-fashioned: every Proxmox node had an identical network stanza, stored in configuration management, and switch ports were provisioned via a standard template. They still had incidents—everyone does—but the blast radius was smaller.

One afternoon, a new VLAN for a business app was introduced. The VLAN was added to the firewall and core switching, but not to the access layer ports feeding two of the four hypervisors. VMs on those two nodes couldn’t reach their gateway; VMs on the other nodes could.

Here’s where the boring practice paid off: the on-call ran the fast playbook, compared trunk allowed VLANs between ports, and found the mismatch in minutes. No prolonged debate about whether Proxmox “supports that VLAN” or whether the guest driver was “acting weird.” The evidence was clean because the baseline was consistent.

The fix was a single switch template update and reapply. Then they updated their change checklist to include: “Add VLAN to hypervisor trunks” as an explicit item. It’s not clever, but it’s the kind of boring that keeps your sleep schedule intact.

FAQ

1) Should I use VLAN-aware bridges in Proxmox?

Yes, if you’re using Proxmox to tag VM traffic (the common case). VLAN-aware bridges make the Linux bridge behave like a real switch with VLAN membership, which prevents accidental leakage and makes tagging deterministic.

2) My VM is tagged in Proxmox. Should the guest NIC also create VLAN subinterfaces?

No. If Proxmox applies tag=10 on the VM NIC, the guest should treat that NIC as untagged. Guest VLAN subinterfaces are for cases where the VM needs to carry multiple VLANs or act as a router/firewall.

3) Why did enabling VLAN filtering break my host management network?

Because your management IP was probably relying on untagged/native VLAN behavior. When filtering is enabled, the bridge may enforce VLAN membership differently, or your PVID/untagged VLAN handling changes. Fix it by putting management on an explicit tagged interface like vmbr0.10.

4) How do I tell whether the switch is dropping my VLAN?

Run tcpdump -eni bond0 vlan X. If you see tagged ARP requests leaving but no replies returning, the VLAN is likely not permitted on the trunk or the gateway isn’t present on that VLAN.

5) Do I need to configure bridge VLAN tables manually?

Often Proxmox manages much of it when you set VLAN tags on VM NICs and enable VLAN-aware mode. But you still need to validate with bridge vlan show, especially when mixing guest-tagged trunks, bonds, or unusual uplink designs.

6) What’s the difference between “bridge-vids” and “allowed VLANs” on the switch?

bridge-vids defines what VLAN IDs the Linux bridge considers valid/allowed (depending on config). The switch “allowed VLANs” defines what the switch will forward on that trunk. Both must include the VLAN, or traffic dies.

7) My VLAN works on one node but not after migration. What should I standardize?

Standardize: switch trunk config per node, bonding mode, bridge VLAN-aware setting, and VLAN allowance. Then test migrations as part of acceptance. Migration failures are almost always configuration drift.

8) Can Proxmox SDN features cause VLAN confusion?

They can, mainly by abstracting away the underlying Linux bridge behavior. If something breaks, drop to fundamentals: inspect bridge vlan show, ip -d link, and confirm tags with tcpdump. The kernel is still doing the forwarding.

9) Why does ARP work but TCP fails?

Because ARP proves only L2 neighbor discovery. TCP can fail due to MTU/PMTUD, firewall rules, asymmetric routing, or upstream ACLs. After ARP succeeds, test MTU and trace routing/policy.

10) Is it okay to trunk “all VLANs” to Proxmox?

Technically yes. Operationally, it depends on your security model. In many environments, limiting allowed VLANs reduces blast radius. If you do limit them, treat it as a contract: new VLAN requires switch + Proxmox updates together.

Conclusion: next steps that prevent repeats

If your Proxmox VLAN isn’t working, don’t keep clicking around the UI hoping the universe forgives you. Do the measurable steps:

  1. Prove tagging with tcpdump on the uplink.
  2. Prove permission with bridge vlan show and switch allowed VLANs.
  3. Prove L3 with ARP/neighbors and gateway reachability.
  4. Standardize configs across nodes so migration doesn’t become roulette.
  5. Make management explicit (tagged subinterface), and stop relying on native VLANs unless you truly mean it.

Then write down your chosen model (host tags vs guest tags), template the switch ports, and keep one canary VM per VLAN. It’s not fancy. It works. And in production, “works” is the feature.

← Previous
NUMA without tears: why dual-socket isn’t ‘double fast’
Next →
WordPress admin-ajax.php 400/403: What Blocks AJAX and How to Fix It

Leave a comment