You built the VM network. You tagged the VLAN. You clicked “VLAN aware”. Still: no DHCP, no ping, and your switch insists everything is fine. Meanwhile, the ticket says “urgent” because someone’s “simple app VM” can’t reach its database, and you’re the one holding the pager.
This is the Proxmox VLAN failure pattern: the hypervisor is doing something perfectly reasonable, the switch is doing something perfectly reasonable, and your design is the thin layer of misunderstanding in between. Let’s fix it like we mean it—repeatably, with proof, and without “try rebooting the node” as the plan.
Fast diagnosis playbook
If VLANs “don’t work” you can lose hours arguing with the switch team, the firewall team, and the person who “only changed a description.” Don’t. Do this instead. You’re hunting for the first place the tag is wrong, missing, or dropped.
First: prove the tag on the wire (hypervisor side)
- Identify where the VM plugs in (tap the bridge port, not your gut).
- Capture traffic and inspect VLAN tags. If the frames leaving the node are untagged when they should be tagged, stop blaming the switch.
- Confirm the bridge has VLAN filtering enabled. A “VLAN aware” checkbox you didn’t actually apply is a classic.
Second: prove the switch port is a trunk (and the native VLAN isn’t sabotaging you)
- Check allowed VLANs. “Trunk” isn’t enough; trunks can still be restricted.
- Check native VLAN. If your environment assumes “no untagged frames” but the port assigns untagged traffic to VLAN 1 (or some other), you just built a quiet black hole.
Third: validate the guest’s tagging expectations
- Either Proxmox tags for the guest (set VLAN tag on the VM NIC), or the guest tags itself (VLAN sub-interface inside the VM). Doing both produces double-tagging and sadness.
- Confirm DHCP is on the correct VLAN. Plenty of “VLAN issues” are really “DHCP relay isn’t configured there.”
One quote to keep you honest: Hope is not a strategy.
— General Gordon R. Sullivan (commonly cited in operations contexts). It’s blunt, but it beats “I think it should work.”
The mental model: where VLAN tagging actually happens
VLAN troubleshooting becomes easier once you stop thinking in terms of “VLANs on a bridge” as magic, and start thinking in terms of who adds the 802.1Q tag and who is expected to remove it.
There are only three actors that matter
- The guest (VM or container). It may send tagged frames (VLAN subinterfaces) or plain frames (normal NIC).
- The Proxmox host. It may tag/untag frames per VM NIC setting and bridge VLAN filtering rules, or it may pass tags through.
- The switch port. It may treat the port as access (untagged, one VLAN) or trunk (tagged, many VLANs, plus a native/untagged VLAN).
Everything else is decoration. When “VLAN not working” happens, one of these actors is adding a tag when it shouldn’t, failing to add a tag when it should, or dropping traffic because it sees an unexpected tag.
Proxmox-specific reality check
Proxmox gives you two mainstream ways to bridge: Linux bridge (the default) and Open vSwitch. Both can work. Linux bridge with VLAN filtering is absolutely fine for most deployments and has the fewest moving parts. Prefer it unless you have a strong reason not to.
VLAN-aware bridge in Proxmox means: the Linux bridge has VLAN filtering enabled so that the bridge is capable of per-port VLAN membership, and Proxmox can program those VLAN rules based on VM NIC tags.
Key detail: a “VLAN tag” set on a VM NIC in Proxmox is not just a UI field. It turns that VM NIC into an access port for that VLAN, but on a trunk link upstream. The host tags traffic leaving the physical uplink, and un-tags traffic delivered to the guest NIC.
Misalignment is the disease. Alignment is the cure.
Joke #1: VLANs are like office politics—everyone swears they’re simple until you ask who owns the trunk and what the native VLAN is.
Interesting facts and historical context
- 802.1Q tagging adds 4 bytes to an Ethernet frame; this is why VLANs and MTU can collide in surprising ways when you’re close to 1500.
- VLAN 1 is historically the default on many switches; “native VLAN” defaults commonly point there, which is how untagged traffic silently ends up somewhere you didn’t intend.
- Linux bridge VLAN filtering became mainstream in the mid-2010s; before that, many Linux setups relied on separate VLAN subinterfaces (like
eth0.100) instead of per-port VLAN rules. - Proxmox moved many users from “manual Debian networking” to an opinionated model where the UI writes
/etc/network/interfaces. This improved consistency—and created new ways to be confidently wrong. - Hardware offloads (especially VLAN offload) can make packet captures lie if you capture on the wrong interface; sometimes the kernel sees tags that tcpdump doesn’t, or vice versa.
- Q-in-Q (802.1ad) exists for “double tagging,” but most SMB/enterprise virtualization setups are not doing provider bridging. Accidental double tagging looks like Q-in-Q, but it isn’t intentional and usually gets dropped.
- Bridges and bonds have had a long, occasionally messy relationship in Linux networking; some “VLAN problems” are really bond mode / LACP mismatch issues that manifest as selective packet loss.
- Spanning Tree was designed to prevent loops in bridged networks; misconfigured STP can look like a VLAN outage because traffic is blocked, not lost.
The correct configuration patterns (and when to use them)
Pattern A: “Trunk to the host, Proxmox tags per VM NIC” (recommended)
This is the clean production pattern for most Proxmox clusters. The switch port to the node is a trunk. The Linux bridge is VLAN-aware. Each VM NIC gets a VLAN tag in Proxmox. Guests think they’re on a plain Ethernet network. Life is good.
Use when: you want simple guest configs, centralized VLAN control, and fewer “why is this VM using VLAN 200?” mysteries.
Avoid when: you need the guest to see multiple VLANs on one vNIC (e.g., router VM, firewall VM). For that, you want tag passthrough.
Example /etc/network/interfaces
cr0x@server:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
iface eno1 inet manual
auto vmbr0
iface vmbr0 inet static
address 10.10.10.11/24
gateway 10.10.10.1
bridge-ports eno1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
The bridge-vlan-aware yes bit is the ticket. bridge-vids 2-4094 tells the bridge which VLAN IDs are even allowed to exist on this bridge. If you omit it, Proxmox still works in many cases because it programs VLAN membership dynamically, but in production I prefer being explicit. It’s one less “why is VLAN 3000 not passing?” surprise.
Pattern B: “Trunk to the host, guest tags itself” (tag passthrough)
Here you want the VM to receive tagged frames and create VLAN subinterfaces inside the guest. This is how you build a router VM, firewall VM, Kubernetes nodes doing CNI tricks, or appliances that expect a trunk.
Proxmox configuration: you set the VM NIC VLAN tag to “no tag” (blank / 0 depending on UI). The bridge must still be VLAN-aware if you want filtering and to avoid accidental leakage; but you configure it to allow the VLANs on that port.
Failure mode: if you set a VLAN tag in Proxmox and tag inside the guest, you created double-tagging. Most switch ports won’t pass it. Your packet capture will look like modern art.
Pattern C: “Access port to the host, single VLAN only” (not recommended, but exists)
This is what people do when they don’t control the switch or they’re “just testing.” The switch port is an access port in one VLAN. The Proxmox bridge is not VLAN-aware (or it can be, but it’s pointless here). Every VM connected to that bridge lands in the same VLAN unless you start doing VLAN subinterfaces on the host.
Use when: labs, single-purpose nodes, or a provider environment where the access VLAN is the product.
Avoid when: you care about scaling, multi-tenant segmentation, or your future self’s sanity.
Pattern D: “Bond (LACP) + trunk + VLAN-aware bridge” (common in clusters)
Most clusters bond two NICs for redundancy and throughput, then put a VLAN-aware bridge on top. This is fine. But it introduces additional failure modes: LACP not negotiated, hashing causing asymmetric issues, or a switch port configured as standalone while the host expects LACP.
The best practice: keep it boring. Bond in 802.3ad if your switch team supports it and will configure it properly; otherwise use active-backup and sleep at night.
Joke #2: LACP is great until it isn’t; then it’s a distributed system with two nodes and three opinions.
Practical tasks: commands, output meaning, and decisions
You don’t debug VLANs with vibes. You debug them with observations. Below are field-tested checks that answer “what is true right now?” and what to do next.
Task 1: Confirm Proxmox thinks the bridge is VLAN-aware
cr0x@server:~$ grep -A10 -n "iface vmbr0" /etc/network/interfaces
12:iface vmbr0 inet static
13- address 10.10.10.11/24
14- gateway 10.10.10.1
15- bridge-ports eno1
16- bridge-stp off
17- bridge-fd 0
18- bridge-vlan-aware yes
19- bridge-vids 2-4094
What it means: If bridge-vlan-aware yes is absent, Proxmox cannot apply per-port VLAN filtering the way you expect.
Decision: If missing, add it (and ideally bridge-vids), then reload networking in a maintenance window.
Task 2: Verify VLAN filtering is actually enabled in the kernel bridge
cr0x@server:~$ bridge link show
2: eno1 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> master vmbr0
3: tap100i0 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> master vmbr0
4: fwpr100p0 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> master vmbr0
cr0x@server:~$ bridge vlan show
port vlan-id
eno1 1 PVID Egress Untagged
eno1 100
tap100i0 100 PVID Egress Untagged
fwpr100p0 100 PVID Egress Untagged
What it means: You’re looking for VLAN membership per port. If bridge vlan show is empty or shows only VLAN 1 everywhere, the bridge isn’t filtering the way you think.
Decision: If the VM should be in VLAN 100 and its tap port is not PVID 100, fix the VM NIC tag or the bridge VLAN config.
Task 3: Check the Proxmox VM NIC configuration (tagging mode)
cr0x@server:~$ qm config 100 | egrep "net0|bridge|tag|trunks"
net0: virtio=DE:AD:BE:EF:10:00,bridge=vmbr0,tag=100,firewall=1
What it means: This VM NIC is an access port for VLAN 100. The guest should not create VLAN subinterfaces for this NIC.
Decision: If the guest is tagging too, remove guest VLAN config or remove Proxmox tag and configure trunking properly.
Task 4: Validate the physical uplink is up, correct speed, and not flapping
cr0x@server:~$ ip -br link show eno1
eno1 UP 5c:ba:ef:00:11:22 <BROADCAST,MULTICAST,UP,LOWER_UP>
cr0x@server:~$ ethtool eno1 | egrep "Speed|Duplex|Link detected"
Speed: 10000Mb/s
Duplex: Full
Link detected: yes
What it means: Link is up. If you see speed drops or link flaps, you can chase VLAN ghosts forever while the physical layer burns.
Decision: If link is unstable, stop here and fix cabling/SFPs/switch port errors.
Task 5: Check for switch-facing errors and drops on the host
cr0x@server:~$ ip -s link show dev eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 5c:ba:ef:00:11:22 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
987654321 1234567 0 124 0 45678
TX: bytes packets errors dropped carrier collsns
876543210 1122334 0 0 0 0
What it means: Dropped RX packets might be congestion, ring buffer limits, or upstream policing. It’s not proof of VLAN issues, but it’s a red flag.
Decision: If drops climb during tests, investigate switch port congestion, NIC driver, and MTU mismatches.
Task 6: Confirm the bridge MTU and consider VLAN overhead
cr0x@server:~$ ip -d link show vmbr0 | egrep "mtu|vlan|bridge"
2: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
bridge vlan_filtering 1 vlan_default_pvid 1
What it means: VLAN filtering is on (vlan_filtering 1). MTU is 1500, which is fine for most VLAN networks but can bite you if your environment expects jumbo frames end-to-end.
Decision: If you run MTU 9000 on storage/vMotion-like networks, enforce consistent MTU on NIC, bridge, and switch trunk.
Task 7: tcpdump on the bridge to see VLAN tags (or lack thereof)
cr0x@server:~$ tcpdump -eni eno1 -c 10 vlan
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:01:11.123456 5c:ba:ef:00:11:22 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 346: vlan 100, p 0, ethertype IPv4, 0.0.0.0.68 > 255.255.255.255.67: DHCP, Request
10:01:11.223456 00:11:22:33:44:55 > 5c:ba:ef:00:11:22, ethertype 802.1Q (0x8100), length 370: vlan 100, p 0, ethertype IPv4, 10.10.100.1.67 > 10.10.100.50.68: DHCP, Reply
What it means: You can literally see VLAN 100 on the wire. This immediately answers “is the host tagging and receiving correctly?”
Decision: If you don’t see VLAN tags leaving the host, focus on Proxmox bridge/VM NIC config. If you see tags leaving but nothing returning, focus on the switch trunk and upstream L2/L3.
Task 8: tcpdump on the VM tap interface to confirm untagged delivery to guest
cr0x@server:~$ tcpdump -eni tap100i0 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on tap100i0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:02:01.100001 de:ad:be:ef:10:00 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 328: 0.0.0.0.68 > 255.255.255.255.67: DHCP, Request
10:02:01.200002 00:11:22:33:44:55 > de:ad:be:ef:10:00, ethertype IPv4 (0x0800), length 352: 10.10.100.1.67 > 10.10.100.50.68: DHCP, Reply
What it means: No 802.1Q here. That’s expected when Proxmox is doing the tagging and the guest is “access VLAN”.
Decision: If you see 802.1Q tags on the tap when you don’t expect them, you’ve set up tag passthrough or the guest is tagging unexpectedly.
Task 9: Check that Proxmox firewall isn’t dropping what you think is “VLAN traffic”
cr0x@server:~$ pve-firewall status
Status: enabled/running
cr0x@server:~$ pve-firewall compile
root@pam: OK
What it means: Firewall is active and compiles. VLAN itself isn’t “blocked by firewall,” but DHCP, ARP, or ICMP might be.
Decision: Temporarily disable firewall on the VM NIC (not globally) to isolate. If it fixes it, you’ve got rules to review—not a VLAN problem.
Task 10: Validate ARP resolution in the host namespace (a fast L2 sanity check)
cr0x@server:~$ ip neigh show dev vmbr0 | head
10.10.10.1 lladdr 00:aa:bb:cc:dd:ee REACHABLE
10.10.10.50 lladdr de:ad:be:ef:10:00 STALE
What it means: The host sees neighbors on vmbr0 (likely management VLAN). This doesn’t prove VLAN 100 works, but it proves the bridge is alive and doing L2.
Decision: If neighbor table is empty and you can’t ARP even in management, fix base connectivity before VLAN work.
Task 11: For trunk-to-guest mode, confirm the guest sees tagged VLANs (host side)
cr0x@server:~$ qm config 200 | egrep "net0|tag"
net0: virtio=DE:AD:BE:EF:20:00,bridge=vmbr0,firewall=0
cr0x@server:~$ tcpdump -eni tap200i0 -c 5 vlan
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on tap200i0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:03:11.010101 de:ad:be:ef:20:00 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 346: vlan 200, p 0, ethertype IPv4, 0.0.0.0.68 > 255.255.255.255.67: DHCP, Request
What it means: The tap sees VLAN tags, which is expected when the guest is trunking/tagging.
Decision: If you expect trunk mode but see no VLAN tags, the guest isn’t tagging, or Proxmox is stripping tags via an unintended config.
Task 12: Check bond status if you’re using one (LACP/active-backup sanity)
cr0x@server:~$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
802.3ad info
LACP rate: fast
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 17
Partner Key: 42
Partner Mac Address: 00:25:90:12:34:56
Slave Interface: eno1
MII Status: up
Actor Churn State: none
Partner Churn State: none
Slave Interface: eno2
MII Status: up
Actor Churn State: none
Partner Churn State: none
What it means: LACP negotiated and both slaves are up. If Partner Key/MAC is missing or only one slave is active unexpectedly, your upstream port-channel config may be wrong.
Decision: Fix LACP/port-channel alignment before chasing VLAN configuration. A broken bond can drop some VLANs “randomly” depending on hashing and upstream behavior.
Task 13: Confirm Proxmox applied config (no stale runtime state)
cr0x@server:~$ ifreload -a
warning: vmbr0: interface is already configured
warning: eno1: interface is already configured
What it means: Reload happened; warnings can be normal. What you want is: no hard errors, and runtime state matches the file.
Decision: If reload fails or you’re afraid of it, plan a maintenance window and restart networking carefully. Don’t wing it remotely unless you enjoy surprise console sessions.
Task 14: Check for VLAN-related sysctl or nftables surprises (rare, but real)
cr0x@server:~$ sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.bridge.bridge-nf-call-arptables
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-arptables = 1
What it means: Bridged traffic is being fed into iptables/nftables hooks. This can be desired (filtering) or disastrous (unexpected drops/latency).
Decision: If you’re not intentionally filtering bridged traffic, consider disabling these or audit firewall rules carefully. “VLAN broken” sometimes means “bridge netfilter ate DHCP.”
Three corporate mini-stories from the VLAN trenches
Mini-story 1: The outage caused by a wrong assumption
A mid-size company was consolidating racks. The virtualization team moved a Proxmox node from an old switch stack to a new leaf. The switch team said, “Port is trunked, you’re good.” The Proxmox node came up, management worked, but every tenant VLAN on that node was dead.
The virtualization team assumed “trunked” meant “all VLANs allowed.” On the new switch template, trunks were restricted to a small default allow-list. Management VLAN was included (because everything needs management), but tenant VLANs weren’t. The host dutifully tagged frames; the switch dutifully dropped them. Both sides behaved correctly.
It got worse because the symptom was selective: some VMs on the node could talk to “shared services” that were accidentally in the management VLAN, so the app owners reported “intermittent” issues. Someone even floated “Proxmox is buggy with VLANs.” It wasn’t.
The fix was boring: align the allowed VLAN list with the hypervisor’s intended VLAN range, and document it. The postmortem action item that mattered wasn’t “improve monitoring.” It was “stop using the word trunk without specifying allowed VLANs and native VLAN behavior.”
Mini-story 2: The optimization that backfired
Another shop wanted to reduce “network complexity.” They decided every Proxmox node would have a single bond, and every bridge would run with a trimmed VLAN list: only VLANs currently in use. The idea was to reduce the blast radius if a VM was mis-tagged.
It worked—until it didn’t. A new project spun up on VLAN 240, and the VM team tagged the NIC in Proxmox correctly. But bridge-vids on vmbr0 only allowed 2-239. The Linux bridge filtered the VLAN before it ever hit the switch. From the switch perspective, nothing was happening. From Proxmox’s perspective, nothing was “wrong.” From the VM perspective, the universe was silent.
The ticket bounced between teams because everyone checked their own layer: the VM config had tag=240, the switch trunk allowed VLAN 240, the firewall had rules, DHCP was ready. The missing piece was that the host bridge had been “optimized” to exclude future VLANs.
They kept VLAN filtering (good), but changed policy: allow a broad VLAN range on the node uplinks, and control VM VLAN access via Proxmox permissions and review. Security-by-outdated-allowlist is not security. It’s a time bomb.
Mini-story 3: The boring practice that saved the day
A financial services team ran Proxmox with strict change control. Every node had a standard network config: bond0 (active-backup), vmbr0 for management, vmbr1 VLAN-aware trunk for guests, and a small “network validation VM” present on each node. That VM could run DHCP requests, ping gateways, and emit tagged frames on demand.
One morning, a whole rack lost connectivity for VLAN 310 only. Not management, not other VLANs. The switch logs were noisy but inconclusive. People started suspecting an ACL change on the core.
The on-call SRE used the validation VM on two nodes: one in the affected rack, one in a healthy rack. Same VLAN tag, same DHCP test, same ping target. Captures on the physical uplinks showed the affected rack never received replies—no DHCP Offer, no ARP response—while the healthy rack did.
That evidence narrowed the blast radius to “somewhere between ToR and upstream” in minutes. The switch team found a port-channel member with a stale config that didn’t allow VLAN 310. Because the bond was active-backup, failing over the active link moved traffic to the correctly-configured member and restored service quickly.
The boring practice was: standardized configs, a tiny validation VM, and a habit of capturing packets at the edge. It wasn’t glamorous. It was effective, which is better.
Common mistakes: symptom → root cause → fix
1) No DHCP on a VLAN-tagged VM, but link is up
Symptom: VM NIC shows “connected,” but DHCP times out. Pings fail.
Root cause: Switch port is access VLAN (or trunk missing VLAN) while Proxmox is tagging VLANs, so tagged frames are dropped.
Fix: Configure switch port as trunk and allow the VLAN. Confirm with tcpdump -eni eno1 vlan that replies arrive.
2) VM can talk on the wrong network (surprise connectivity)
Symptom: A VM intended for VLAN 200 gets an IP from VLAN 1 or management VLAN.
Root cause: Native VLAN / PVID mismatch: untagged traffic is being mapped to an unintended VLAN somewhere (host or switch).
Fix: Decide what untagged means. Prefer “no untagged” on trunks in production. Make sure vmbr uplink isn’t using VLAN 1 as PVID unless that’s intentional.
3) Only one VLAN works; others are dead
Symptom: VLAN 100 works, VLAN 200 doesn’t, on the same host/uplink.
Root cause: Allowed VLAN list on trunk is restricted, or bridge-vids excludes the VLAN.
Fix: Check bridge vlan show and the switch trunk allow-list. Expand the allowed set and retest.
4) “Intermittent VLAN issues” across bonded uplinks
Symptom: Random packet loss, some flows work, others don’t; worse under load.
Root cause: LACP mismatch or misconfigured port-channel members with inconsistent allowed VLANs/native VLAN/MTU.
Fix: Validate bond state in /proc/net/bonding/* and ensure switch port-channel members are identical. Don’t accept “mostly identical.”
5) Packet captures show no VLAN tags even though you set tag=100
Symptom: tcpdump on eno1 doesn’t show 802.1Q.
Root cause: VLAN offload/capture point confusion, or traffic never leaves because bridge filtering blocks it.
Fix: Capture on both the tap and physical uplink; verify bridge vlan show. If needed, temporarily disable VLAN offload (during diagnostics) and retest.
6) Guests on the same VLAN can’t talk to each other
Symptom: Two VMs on VLAN 300 can reach gateway sometimes, but not each other reliably.
Root cause: Proxmox firewall rules, ebtables/nft filtering, or STP/loop protection blocking intra-host switching.
Fix: Temporarily disable VM firewall, confirm L2 works, then reintroduce rules carefully. Check bridge netfilter sysctls.
7) Containers (LXC) behave differently than VMs
Symptom: VLAN-tagged VM works; VLAN-tagged container doesn’t (or vice versa).
Root cause: Container uses different virtual interface behavior, and you may be tagging at the wrong layer (container config vs bridge port).
Fix: Standardize: either tag at Proxmox veth interface (host side) or inside the container, not both, and confirm with bridge vlan show.
8) Migration to another node breaks the VM network
Symptom: VM works on node A, fails on node B.
Root cause: Inconsistent network config across nodes: bridge not VLAN-aware, different bond/switch trunk, missing VLANs on allow-list.
Fix: Treat Proxmox networking as cluster-level configuration. Same vmbr names, same VLAN-aware settings, same physical trunk behavior everywhere.
Checklists / step-by-step plan
Step-by-step: Build a correct VLAN-aware bridge on a single uplink
- Decide the model: host tags per VM NIC (recommended) or guest tags itself (trunk-to-guest). Write it down. This avoids double-tagging.
- Configure the switch port: trunk, allow VLANs you will use, set native VLAN intentionally (or effectively disable untagged). Keep MTU consistent.
- Configure Proxmox bridge: Linux bridge with
bridge-vlan-aware yes. Optionally setbridge-vidsto a sane range. - Attach VM NICs: for access style, set
tag=VLANIDon each VM NIC. For trunk-to-guest, leave tag unset. - Validate with captures: confirm tags on the physical uplink, and untagged frames at the tap for access-style VMs.
- Lock in consistency: replicate across nodes. Inconsistency is how migrations become outages.
Checklist: Before you blame VLANs
- Is the physical link stable? (
ethtool,ip -s link) - Is the bridge actually VLAN-filtering? (
bridge vlan show) - Is the VM NIC tagged in exactly one place? (Proxmox or guest)
- Does the switch trunk allow the VLAN? (not “is it trunked”)
- Do you see DHCP Offer/ARP replies on the uplink capture?
- Is MTU consistent end-to-end for the VLAN?
- Is firewall filtering bridged traffic unexpectedly?
Checklist: Safe change procedure for production nodes
- Schedule a window if you might touch
bridge-ports, bonds, or reload networking. - Have out-of-band console access ready. Remote-only changes to networking are a lifestyle choice.
- Change one variable at a time: switch trunk allow-list, then host bridge, then VM tags.
- Capture before/after on the physical uplink. It’s your truth serum.
- After changes, test: DHCP, ping gateway, ping between VMs, and a small TCP session.
FAQ
1) Do I need Open vSwitch to make VLANs work in Proxmox?
No. Linux bridge with VLAN-aware enabled is enough for most deployments. OVS is useful if you need its features or operational model, not as a VLAN band-aid.
2) What does “VLAN aware” actually change?
It enables VLAN filtering on the Linux bridge, allowing per-port VLAN membership and letting Proxmox program access/trunk behavior per VM NIC.
3) Should I set bridge-vids 2-4094?
In production, yes—unless you have a reason to restrict it. It prevents weird “VLAN not passing because it’s outside the allowed range” problems when someone adds a new VLAN later.
4) Why does my VM lose connectivity after live migration?
The destination node likely has different VLAN bridge settings, a different trunk allow-list upstream, or a different vmbr mapping. Make node networking identical for anything that hosts migrated workloads.
5) Where should VLAN tagging happen: in Proxmox or inside the guest?
Pick one. For normal VMs, tag in Proxmox (simpler, centrally controlled). For router/firewall/appliance VMs that need multiple VLANs, tag inside the guest and pass the trunk through.
6) Why does tcpdump sometimes not show VLAN tags when I expect them?
VLAN offload can cause tags to be handled in hardware, making captures confusing depending on interface and driver. Capture on multiple points (tap and physical), and correlate with bridge vlan show.
7) Can Proxmox firewall break DHCP on a VLAN?
Yes. DHCP is broadcast-heavy and sensitive to filtering. If bridged traffic is passing through netfilter hooks, a rule can drop DHCP or ARP and make it look like “VLAN is broken.”
8) Do VLANs require a higher MTU?
Not necessarily. Standard MTU 1500 works with VLAN tagging in most environments. Problems show up when you’re trying to run jumbo frames and one link in the path didn’t get the memo.
9) My switch says the port is trunking, but VLAN still fails. What should I ask for?
Ask for: allowed VLAN list, native VLAN, and confirmation that all port-channel members have identical VLAN/MTU settings. “It’s a trunk” is not an answer.
10) Is it safe to run management traffic on the same VLAN-aware bridge as guests?
It can be, but it’s not my favorite. Separate management (vmbr0) from guest trunks (vmbr1) when you can. It reduces blast radius and makes debugging less cinematic.
Conclusion: next steps that prevent repeat incidents
Fixing a Proxmox VLAN issue is rarely about “the right magic setting.” It’s about making the tagging contract explicit: who tags, where it’s allowed, and what untagged means. Once you can point at a packet capture and say “this is VLAN 100 leaving the host,” arguments stop and engineering resumes.
Do these next:
- Standardize node networking across the cluster (same vmbr names, VLAN-aware settings, and bond behavior).
- Document switch trunk expectations in operational language: allowed VLANs, native VLAN, MTU, LACP policy.
- Add a repeatable validation routine: a known VM or script that can DHCP/ping on a specified VLAN, plus a capture recipe.
- Make double-tagging hard by policy: either Proxmox tags access VLANs, or the guest does trunking—never both.
If you do nothing else, remember this: VLAN debugging is just checking whether the tag you believe in is the tag that’s actually on the wire.