Networking on Proxmox vs ESXi: VLAN trunking, bridges/vSwitches, bonding/LACP explained

Was this helpful?

If you’ve ever migrated a cluster and watched “simple networking” turn into a four-hour outage, you already know the truth:
hypervisor networking is where assumptions go to die. VLANs behave. Trunks behave. Until the day they don’t—because the box is
doing exactly what you told it to, not what you meant.

This is the practical map between Proxmox and ESXi networking: how VLAN trunking actually lands on a Linux bridge versus a vSwitch,
how bonding/LACP works (and how it fails), and what to check when packets vanish into the void.

1) The mental model: bridges vs vSwitches

ESXi networking is a purpose-built virtual switch stack. You create vSwitches (standard or distributed), define port groups,
attach vmnics (uplinks), and you get policy enforcement in a UI-driven system with consistent abstractions.

Proxmox networking is Linux networking with a GUI. Under the hood you’re using Linux bridges (often named vmbr0),
VLAN subinterfaces, and Linux bonding. The “switch” is the kernel. The upside is you can debug it like any other Linux host,
with standard tools. The downside is you can also shoot yourself in the foot like any other Linux host, with standard tools.

What maps to what

  • ESXi vSwitch / vDSLinux bridge (Proxmox vmbrX)
  • ESXi uplinks (vmnicX)Linux NICs (eno1, ens3f0, etc.) or bond (bond0)
  • ESXi Port GroupVLAN on a bridge port (VLAN-aware bridge + VM tag) or a dedicated bridge/VLAN subinterface
  • ESXi VLAN ID 4095 (guest trunk) ≈ Proxmox NIC with no VLAN tag + guest does tagging

The big conceptual difference: ESXi tends to push VLAN identity into port groups, while Proxmox tends to push it into the VM’s NIC
configuration when you use VLAN-aware bridges. Both can do either style. But humans are creatures of habit, and migrations fail
in the gaps between habits.

The one reliability quote you should tattoo on your runbooks

“Hope is not a strategy.” — paraphrased idea often cited in operations and reliability circles.
Treat VLAN and LACP configs like code: review, test, and roll back.

2) VLAN trunking: Proxmox vs ESXi in plain terms

A VLAN trunk is just “carry multiple VLANs over one physical link.” The fights start when you ask: “Where does the tagging happen?”

ESXi VLAN trunking

In ESXi, trunking is mostly expressed through port groups:

  • VLAN ID = N: ESXi tags frames leaving that port group with VLAN N and accepts VLAN N inbound.
  • VLAN ID = 0: priority-tagging behavior; used rarely in typical server virtualization.
  • VLAN ID = 4095: special case meaning “VLAN trunk to the guest.” ESXi passes VLAN tags through, and the VM handles tagging.

When you attach a vmnic uplink to a vSwitch or vDS, you’re effectively saying: “This uplink is connected to a switchport that is
configured correctly for whatever VLAN behavior these port groups expect.” ESXi won’t configure your physical switch, and your
physical switch won’t read your mind. That’s why tickets exist.

Proxmox VLAN trunking

In Proxmox, the most common “modern” pattern is:

  • Make vmbr0 a VLAN-aware bridge.
  • Attach your trunk uplink (a physical NIC or bond) as a bridge port.
  • For each VM NIC, set the VLAN tag in the VM’s hardware config.

That is conceptually similar to ESXi port groups with VLAN IDs, except Proxmox puts the tag on the VM NIC instead of on a port group object.
Operationally, it’s fine—until you have to audit 200 VMs for “who’s on VLAN 132?” and you miss one because it was configured differently in
that one special case from 2019.

Alternative Proxmox pattern: create VLAN subinterfaces on the host (e.g., bond0.100) and build a separate bridge per VLAN (e.g., vmbr100).
It’s more verbose, but it’s also brutally clear. Clarity is a performance feature when the outage clock starts.

Trunk definition: don’t confuse “allowed VLANs” with “native VLAN”

Physical switch trunks typically have:

  • Allowed VLAN list: what VLAN tags are permitted across the trunk.
  • Native VLAN (or untagged VLAN): what happens to untagged frames (they’re mapped into a VLAN).

If your hypervisor uplink sends untagged frames (management, cluster, storage, whatever) and you assume “untagged means VLAN 1”
while the switch uses native VLAN 20, you’ll get connectivity—just in the wrong place. That’s how outages become incidents with
security implications.

Joke #1: VLANs are like office seating charts—everyone agrees they’re necessary, and nobody agrees who sits where.

3) ESXi port groups vs Proxmox VLAN-aware bridges

ESXi’s port group is a policy object: VLAN ID, security settings, traffic shaping, NIC teaming rules, and in vDS land, even things
like NetFlow and port mirroring policies. It’s a clean abstraction for environments with lots of consumers and change control.

Proxmox’s VLAN-aware bridge model is closer to “switching in the kernel”: the bridge sees tags and forwards based on MAC+VLAN.
The VM’s tap interface gets a VLAN tag setting, and Linux handles the tagging/untagging.

What to prefer

If you have a small-to-medium environment, VLAN-aware bridges are the simplest and least fragile pattern on Proxmox.
One bridge, one trunk uplink, tags on VM NICs. It scales operationally until you start needing per-network policies and tight
guardrails.

If you need strong separation of duties (netops defines networks, virt team attaches VMs), ESXi’s port group approach tends to fit
corporate reality better. On Proxmox, you can still do separation—just accept you’ll be building it with conventions, automation,
and review, not magical objects.

Guest VLAN trunking (VM does VLAN tagging)

ESXi makes this explicit with VLAN 4095. Proxmox can do it by not setting a VLAN tag on the VM NIC and ensuring the bridge/uplink
passes tagged frames. Then the guest runs 802.1Q subinterfaces inside the VM.

This is useful for virtual routers, firewalls, or appliances that expect trunk links. It is also the fastest way to create a
troubleshooting rabbit hole where nobody knows whether VLAN tagging is in the guest, the hypervisor, or the physical switch.
Pick one place and document it.

4) Bonding and LACP: what works, what bites

You bond NICs to get redundancy and/or bandwidth. There are two broad categories:

  • Static/team modes (no switch negotiation): active-backup, balance-xor (depending), etc.
  • LACP (802.3ad): negotiated aggregation with the switch, with hashing policies that decide which flow uses which link.

Proxmox bonding

Proxmox typically uses Linux bonding (bond0) with modes like:

  • active-backup: one active link, one standby. Boring. Reliable. My default for management networks.
  • 802.3ad (LACP): all links active, better utilization, but requires correct switch config and consistent hashing.
  • balance-xor: static aggregation style; can work but is easier to misconfigure across switches.

ESXi NIC teaming and LACP

ESXi standard vSwitch NIC teaming is not LACP. It’s “teaming” with various load balancing policies. For actual LACP you usually need
a vDS (distributed switch) depending on version/licensing, and then you define an LACP LAG and attach uplinks.

Translation: if you’re used to “just set LACP on the host,” ESXi may force you into vDS land. If you’re used to “just team NICs,”
Proxmox might tempt you into static modes that work until a switch replacement changes behavior.

Hashing is where dreams go to die

LACP doesn’t magically give a single TCP flow 2x bandwidth. It spreads flows across links based on a hash (MAC/IP/port combos).
If you want “one VM gets 20 Gbps,” you probably need something else (multiple flows, SMB multichannel, NVMe/TCP multiple sessions,
or just a bigger pipe).

Also: storage traffic (iSCSI, NFS) and vMotion/migration traffic can become very sensitive to out-of-order delivery and microbursts
if you “optimize” without measuring. LACP is not a free lunch. It’s a negotiated agreement to argue with physics.

Joke #2: LACP is a relationship—if one side thinks you’re “active-backup” and the other thinks you’re “all-in,” somebody’s going to ghost packets.

5) MTU, offloads, and the “jumbo frames” tax

MTU mismatches are boring and catastrophic. ESXi tends to push MTU settings through vSwitches/vDS and VMkernel interfaces.
Proxmox pushes it through Linux interfaces, bonds, VLAN subinterfaces, and bridges. You can get it right—or you can get it “mostly right,”
which is how you end up with a storage network that works until you turn on replication.

If you run jumbo frames (9000 MTU), run them end-to-end: physical switchports, uplinks, bond, bridge, VLAN interfaces, and guest NICs
where relevant. If you can’t guarantee end-to-end, don’t half-do it. Run 1500 and spend your time on CPU pinning or storage layout instead.

6) vSwitch security knobs vs Linux defaults

ESXi port groups have classic security toggles: Promiscuous Mode, MAC Address Changes, Forged Transmits. They’re there because virtual
switching is a shared medium, and some workloads (IDS, appliances) need special behavior. Many of the worst “mystery outages” in ESXi land
are just these set too strictly for a specific VM.

Proxmox/Linux bridging doesn’t present those exact knobs in the same way, but similar constraints exist: ebtables/nftables rules,
bridge filtering, and guest behavior. If you’re running virtual firewalls, you need to validate that the hypervisor will pass the frames
you expect, including VLAN tags and MAC behavior.

7) Interesting facts & historical context (8 quick hits)

  1. 802.1Q VLAN tagging was standardized in 1998, and we’re still arguing about native VLANs like it’s new.
  2. Linux bridging has been production-grade for decades; it predates many enterprise virtualization stacks’ modern UIs.
  3. LACP (802.3ad / 802.1AX) was designed to standardize link aggregation across vendors; it standardized the arguments too.
  4. ESXi standard vSwitch historically didn’t do LACP in the same way a vDS can; people built “good enough” teaming policies instead.
  5. VLAN 4095 in ESXi isn’t “a VLAN,” it’s a trunk passthrough signal for the port group to carry all VLANs to the guest.
  6. VXLAN/overlay networks became mainstream partly because VLAN scale (4094 usable VLAN IDs) was a real ceiling in large environments.
  7. NIC offloads (TSO/GSO/GRO/LRO) can change packet behavior enough to confuse captures and some security appliances.
  8. STP and bridges still matter: accidental loops can melt networks fast, whether the switch is physical or virtual.

8) Practical tasks: commands, outputs, decisions (12+)

These are the checks I actually run when something smells off. Each task includes: command, sample output, what it means, and the
decision you make.

Task 1: Identify physical NICs and link state (Proxmox)

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00
eno1             UP             3c:fd:fe:aa:bb:01
eno2             DOWN           3c:fd:fe:aa:bb:02
ens3f0           UP             0c:42:a1:11:22:33
ens3f1           UP             0c:42:a1:11:22:34
bond0            UP             0c:42:a1:11:22:33
vmbr0            UP             0c:42:a1:11:22:33

Meaning: You immediately see what’s up/down and whether bonds/bridges exist.
Decision: If a member NIC is down unexpectedly, stop blaming VLANs and go look at cabling/switchport errors first.

Task 2: Verify bond mode and LACP state (Proxmox)

cr0x@server:~$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
        Aggregator ID: 2
        Number of ports: 2
        Actor Key: 17
        Partner Key: 42
        Partner Mac Address: 00:11:22:33:44:55

Slave Interface: ens3f0
MII Status: up
Actor Churn State: none
Partner Churn State: none

Slave Interface: ens3f1
MII Status: up
Actor Churn State: none
Partner Churn State: none

Meaning: Confirms 802.3ad, hash policy, and that both slaves are actually aggregated with a partner MAC.
Decision: If you see “Number of ports: 1” or churn states, suspect switch LACP config mismatch or flapping links.

Task 3: Confirm bridge VLAN awareness and port membership (Proxmox)

cr0x@server:~$ bridge link
2: bond0 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> master vmbr0
5: tap100i0 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> master vmbr0
6: tap101i0 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> master vmbr0

Meaning: Shows what’s attached to the bridge: uplink and VM tap devices.
Decision: If the uplink isn’t in the bridge, your “trunk” is a philosophical concept, not a network.

Task 4: Inspect VLAN filtering on a Linux bridge (Proxmox)

cr0x@server:~$ bridge vlan show
port              vlan-id  
bond0             1 PVID Egress Untagged
bond0             20
bond0             30
tap100i0          20
tap101i0          30

Meaning: The bridge is filtering VLANs; bond0 allows VLANs 20 and 30 and has PVID 1 untagged.
Decision: If the VLAN you need isn’t listed, either allow it (if filtering is enabled) or fix the VM NIC tag configuration.

Task 5: Check host IP addressing and which interface owns it (Proxmox)

cr0x@server:~$ ip -br addr
eno1             UP             192.0.2.10/24
bond0            UP
vmbr0            UP             198.51.100.10/24
vmbr0.20         UP             10.20.0.10/24

Meaning: Shows if management IP is on a bridge, VLAN subinterface, or physical NIC.
Decision: If the host IP is on a physical NIC that’s also in a bridge, expect weirdness. Put IPs on the bridge (or on VLAN interfaces on top of the bridge), not on enslaved ports.

Task 6: Validate MTU consistency across uplink, bond, bridge (Proxmox)

cr0x@server:~$ ip -d link show bond0
3: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:11:22:33 brd ff:ff:ff:ff:ff:ff promiscuity 0
    bond mode 802.3ad miimon 100 updelay 0 downdelay 0 lacp_rate 1 xmit_hash_policy 1

Meaning: MTU is 9000 on the bond; you still need to verify the slave NICs, bridge, and switchports.
Decision: If MTU differs on any hop, either fix it end-to-end or drop back to 1500 and stop paying the jumbo-frames tax.

Task 7: Quick VLAN reachability test with ping + DF bit (Proxmox)

cr0x@server:~$ ping -c 2 -M do -s 8972 10.20.0.1
PING 10.20.0.1 (10.20.0.1) 8972(9000) bytes of data.
8972 bytes from 10.20.0.1: icmp_seq=1 ttl=64 time=0.421 ms
8972 bytes from 10.20.0.1: icmp_seq=2 ttl=64 time=0.398 ms

--- 10.20.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms

Meaning: Confirms jumbo MTU works to the gateway on that VLAN (at least for ICMP).
Decision: If this fails but normal ping works, you have MTU fragmentation/blackhole. Fix MTU or path MTU discovery behavior.

Task 8: See MAC learning on the bridge (Proxmox)

cr0x@server:~$ bridge fdb show br vmbr0 | head
0c:42:a1:11:22:33 dev bond0 master vmbr0 permanent
52:54:00:aa:bb:01 dev tap100i0 master vmbr0 vlan 20
52:54:00:aa:bb:02 dev tap101i0 master vmbr0 vlan 30

Meaning: Confirms the bridge is learning VM MACs per VLAN and where they live.
Decision: If a VM MAC isn’t present while the VM is up and generating traffic, suspect the VM NIC attachment, firewall rules, or that the guest is on the wrong VLAN.

Task 9: Capture tagged traffic on the uplink (Proxmox)

cr0x@server:~$ sudo tcpdump -eni bond0 -c 5 vlan
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:01:11.123456 0c:42:a1:11:22:33 > 01:00:5e:00:00:fb, ethertype 802.1Q (0x8100), length 86: vlan 20, p 0, ethertype IPv4, 10.20.0.10.5353 > 224.0.0.251.5353: UDP, length 44
12:01:11.223456 52:54:00:aa:bb:01 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 20, p 0, ethertype ARP, Request who-has 10.20.0.1 tell 10.20.0.50, length 28

Meaning: You can see VLAN tags on the wire and confirm the VLAN ID.
Decision: If tags are missing on a trunk that should carry them, your tagging is happening somewhere else—or not happening at all.

Task 10: ESXi—confirm vmnic link and driver info (on an ESXi shell)

cr0x@server:~$ esxcli network nic list
Name    PCI Device    Driver  Link  Speed  Duplex  MAC Address         MTU  Description
vmnic0  0000:3b:00.0  ixgbe   Up    10000  Full    3c:fd:fe:aa:bb:01  9000 Intel Corporation 82599ES
vmnic1  0000:3b:00.1  ixgbe   Up    10000  Full    3c:fd:fe:aa:bb:02  9000 Intel Corporation 82599ES

Meaning: Physical uplinks are up, at expected speed/MTU. This removes a whole class of “it’s the network” excuses.
Decision: If speed is 1000 when you expected 10000, stop. Fix optics/cabling/switchport configuration first.

Task 11: ESXi—inspect vSwitches and port groups

cr0x@server:~$ esxcli network vswitch standard list
vSwitch0
   Name: vSwitch0
   Class: etherswitch
   Num Ports: 128
   Used Ports: 12
   Configured Ports: 128
   MTU: 9000
   Uplinks: vmnic0, vmnic1
   Portgroups: Management Network, VM Network, Storage-20

Meaning: Shows MTU, uplinks, and what port groups exist on the standard vSwitch.
Decision: If the relevant port group isn’t on the vSwitch you think it is, you’ve found your miswire—logical, not physical.

Task 12: ESXi—verify VLAN IDs configured on port groups

cr0x@server:~$ esxcli network vswitch standard portgroup list
Name               Virtual Switch  VLAN ID
Management Network vSwitch0        10
VM Network         vSwitch0        0
Storage-20         vSwitch0        20
Trunk-to-FW        vSwitch0        4095

Meaning: Tells you whether frames are tagged by ESXi (VLAN 10/20) or passed through to the guest (4095).
Decision: If a VM expecting untagged traffic sits on VLAN 20, it will be “up” but isolated. Fix the port group VLAN ID, not the VM.

Task 13: Proxmox—validate config file is what you think it is

cr0x@server:~$ sed -n '1,200p' /etc/network/interfaces
auto lo
iface lo inet loopback

auto bond0
iface bond0 inet manual
        bond-slaves ens3f0 ens3f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        bond-lacp-rate 1

auto vmbr0
iface vmbr0 inet static
        address 198.51.100.10/24
        gateway 198.51.100.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

Meaning: Confirms bond mode, bridge VLAN-aware setting, and allowed VLAN range.
Decision: If bridge-vlan-aware is off but you’re relying on VM NIC tags, tags won’t behave as expected. Turn it on or change the design.

Task 14: Proxmox—check firewall rules interfering with L2/L3

cr0x@server:~$ sudo pve-firewall status
Status: enabled/running

Meaning: Firewall is on; now you must account for host and VM firewall rules in your mental model.
Decision: If you’re debugging “ARP works but TCP doesn’t,” temporarily isolating firewall influence (in a controlled window) is often faster than guessing.

9) Fast diagnosis playbook

When networking breaks on hypervisors, the fastest path is to find the layer where reality diverges from your diagram. Check in this order.

First: physical truth (link, speed, errors)

  • Is the NIC link up? Is it the expected speed?
  • On Proxmox: ip -br link, ethtool eno1 (if available), switchport counters.
  • On ESXi: esxcli network nic list.

Bottleneck indicator: speed mismatch (1G instead of 10G), increasing errors/drops, flapping links.

Second: L2 topology (bonding/LACP, bridge/vSwitch attachment)

  • Does the uplink actually attach to the bridge/vSwitch you think?
  • Is LACP negotiated and stable?
  • Proxmox: cat /proc/net/bonding/bond0, bridge link.
  • ESXi: vSwitch uplinks, and if vDS, confirm LAG membership.

Bottleneck indicator: only one slave active in a bond, churn, or uplink not present where it should be.

Third: VLAN correctness (tagging location, allowed VLANs, native VLAN)

  • Is the VLAN tag applied on the right object (VM NIC vs port group vs guest OS)?
  • Does the physical trunk allow the VLAN?
  • Proxmox: bridge vlan show, tcpdump -eni bond0 vlan.
  • ESXi: esxcli network vswitch standard portgroup list.

Bottleneck indicator: ARP requests visible but no replies on expected VLAN, or replies arriving on a different VLAN.

Fourth: MTU and fragmentation blackholes

  • Test with DF ping at expected MTU.
  • Verify MTU on every hop: physical switch, uplink NIC, bond, bridge/vSwitch, VMkernel (ESXi), guest NIC.

Bottleneck indicator: small pings work, large pings fail; storage stalls under load; migrations freeze intermittently.

Fifth: policy and filtering (firewalls, security settings, MAC behavior)

  • ESXi port group security settings block forged transmits or MAC changes for appliances.
  • Proxmox firewall/nftables blocks traffic you assumed was “just L2.”

Bottleneck indicator: one VM type fails (e.g., firewall VM) while normal servers work.

10) Three corporate-world mini-stories

Incident caused by a wrong assumption: “Untagged means VLAN 1”

A mid-sized company moved a set of workloads from ESXi to Proxmox during a hardware refresh. The virtualization team kept the same
physical top-of-rack switches and reused the existing trunk ports. In ESXi, most VM port groups were tagged VLANs, but management
was untagged on the uplinks because “that’s how it’s always been.”

On Proxmox, they built a VLAN-aware vmbr0 and put the management IP directly on vmbr0 without a VLAN tag.
The host came up, but the management network behaved inconsistently: some hosts were reachable, others weren’t, and ARP tables looked
like they’d been stirred with a stick.

The wrong assumption: the team believed the switch’s native VLAN on those trunk ports was VLAN 1. It wasn’t. A prior network standardization
had moved native VLAN to a dedicated “infra” VLAN, and VLAN 1 was explicitly pruned in places. Untagged frames were landing in the infra VLAN,
not the management VLAN. So the hosts were talking—just to the wrong neighborhood.

The fix wasn’t heroic. They tagged management properly (VLAN subinterface or VM NIC tag style) and made the trunk explicit:
allowed VLAN list, native VLAN defined, and documented. The incident report was short and slightly embarrassing, which is exactly how
you want it. The longer the report, the more you’re admitting the system was unknowable.

Optimization that backfired: “Let’s do LACP everywhere and crank MTU to 9000”

Another org ran ESXi with separate networks: management, storage, vMotion. They migrated to Proxmox and decided to “simplify” by
converging everything onto one LACP bond and one VLAN trunk. The pitch was clean: fewer cables, fewer switchports, better bandwidth utilization.

They also enabled jumbo frames end-to-end—or so they thought. Host MTU was 9000, switchports were set, and the bond looked healthy.
In the lab it worked. In production, during backup windows, storage latency spiked and VM migrations intermittently failed. The failures
weren’t total outages. They were worse: partial, unpredictable, and easy to dismiss as “the network being the network.”

The culprit was twofold. First, one switch in the path had an MTU setting applied to the wrong port profile, leaving a single hop at 1500.
Some traffic fragmented, some blackholed, depending on protocol and DF behavior. Second, the LACP hashing policy on the Linux bond was set
to layer3+4, while the switch side was effectively distributing flows differently for certain traffic classes. Under microburst load, one link
saturated while the other looked bored.

The rollback was pragmatic: keep LACP for VM/data VLANs, but put storage on a separate pair of NICs using active-backup and a dedicated VLAN,
with MTU validated hop-by-hop. Performance stabilized. The “simplification” had been real, but it had also removed isolation and made
troubleshooting harder. Convergence is a tool, not a personality.

Boring but correct practice that saved the day: “One bridge, one purpose, tested configs”

A financial services shop ran a mixed fleet: some ESXi clusters, some Proxmox clusters for specific workloads. Their network team insisted
on a rule that sounded like bureaucracy: every hypervisor host must have a small “mgmt-only” design that can be tested independently of VM networks.

On Proxmox, that meant a dedicated management bridge mapped to a dedicated VLAN with active-backup bonding, no cleverness. VM/data VLANs lived
on a separate trunk bridge, often on a different bond. Yes, it consumed more ports. Yes, it made cabling diagrams longer. It also meant you could
reboot, patch, and recover hosts without depending on a converged LACP trunk that might be misbehaving.

During a switch firmware upgrade, one rack’s LACP behavior degraded in a way that didn’t fully drop links but did cause churn and rebalancing.
VM networks saw packet loss. Management stayed clean. The ops team could still reach every host, evacuate VMs, and keep control of the blast radius.

The postmortem wasn’t glamorous. It was mostly a reminder that segmentation is a reliability strategy. When things break, you want a narrow failure
domain. Boring designs produce exciting uptime.

11) Common mistakes: symptom → root cause → fix

1) VMs can’t reach gateway on one VLAN, but other VLANs work

  • Symptom: Only VLAN 30 is dead; VLAN 20 works fine.
  • Root cause: VLAN not allowed on physical trunk, or bridge VLAN filtering doesn’t include it.
  • Fix: Ensure VLAN is in switch trunk allowed list; on Proxmox, confirm bridge-vlan-aware yes and bridge vlan show includes VLAN; on ESXi confirm port group VLAN ID.

2) Management connectivity dies when you add a bridge port

  • Symptom: Add NIC to bridge; host disappears from network.
  • Root cause: Host IP configured on enslaved physical NIC instead of bridge; ARP/MAC moves cause confusion.
  • Fix: Put IP on vmbrX (or on vmbrX.Y VLAN interface), not on the underlying NIC/bond.

3) LACP bond shows up, but only one link carries traffic

  • Symptom: Both links “up,” but utilization pins one link.
  • Root cause: Hashing policy unsuitable for traffic pattern (single-flow dominated), or switch hashing mismatch.
  • Fix: Align hashing policies; accept that a single TCP flow won’t stripe; consider multiple sessions or separate networks for heavy hitters.

4) Storage works until load, then timeouts

  • Symptom: NFS/iSCSI OK at idle, fails under backup/replication.
  • Root cause: MTU mismatch or microbursts on converged links; QoS absent; offloads interacting poorly.
  • Fix: Validate MTU with DF pings end-to-end; separate storage traffic or implement appropriate buffering/QoS; test offload changes carefully.

5) Virtual firewall/router VM sees traffic one way only

  • Symptom: Inbound packets visible, outbound missing (or reverse).
  • Root cause: ESXi port group security settings block forged transmits/MAC changes; or guest trunking misconfigured.
  • Fix: On ESXi, adjust port group security for that appliance; on Proxmox, ensure VLAN tags are passed/handled in the correct layer and firewall rules aren’t dropping.

6) Random duplicate IP warnings or ARP flapping

  • Symptom: Logs show duplicate IP; connectivity oscillates.
  • Root cause: Native VLAN mismatch causing untagged traffic to land in wrong VLAN; or accidental L2 loop.
  • Fix: Make native VLAN explicit and consistent; enable STP where appropriate; validate bridge ports and switchport configs.

12) Checklists / step-by-step plan

Checklist A: Designing VLAN trunking on Proxmox (do this, not vibes)

  1. Decide where VLAN tagging lives: VM NIC tags (VLAN-aware bridge) or host VLAN subinterfaces (bridge per VLAN). Pick one default.
  2. Make the uplink a bond if you need redundancy. Choose active-backup for management by default; use LACP only when you can validate switch config and hashing.
  3. Set bridge-vlan-aware yes on trunk bridges and define allowed VLANs (bridge-vids) intentionally.
  4. Define native/untagged behavior: ideally, don’t rely on untagged for anything important. Tag management too unless you have a hard reason not to.
  5. Validate MTU end-to-end with DF ping tests per VLAN.
  6. Capture traffic on the uplink with tcpdump -eni to confirm VLAN tags are present.
  7. Document: which bridge carries which VLANs, and whether any VM is a “guest trunk.”

Checklist B: Migrating from ESXi port groups to Proxmox VLAN tags

  1. Export or list ESXi port groups and VLAN IDs (include any 4095 trunks).
  2. Create Proxmox trunk bridge(s) and confirm VLAN awareness.
  3. For each VM, map: ESXi port group VLAN ID → Proxmox VM NIC VLAN tag.
  4. For VLAN 4095 VMs: ensure Proxmox VM NIC has no tag and the guest is configured for 802.1Q; validate with packet capture.
  5. Before cutover, test one VM per VLAN: ARP, ping, jumbo ping (if used), and application-level check.
  6. Keep management path isolated enough that you can revert without walking to the datacenter.

Checklist C: LACP rollout without self-inflicted pain

  1. Confirm your switches support LACP as configured (multi-chassis LAG/MLAG if spanning two switches).
  2. Set LACP rate (fast/slow) intentionally and match expectations on both sides.
  3. Align hashing policy. If you don’t know what your switch uses, find out. Don’t guess.
  4. Test failure modes: unplug one link; reboot one switch; confirm traffic survives and rebalances without flapping.
  5. Monitor drops and errors during load. If you can’t measure, you can’t declare success.

13) FAQ

Q1: Is a Proxmox Linux bridge basically the same thing as an ESXi vSwitch?

Functionally, yes: both forward frames between VM interfaces and uplinks. Operationally, no: ESXi is an appliance abstraction;
Proxmox is Linux networking with all the power and sharp edges that implies.

Q2: Should I use one big trunk bridge for everything on Proxmox?

For many environments, one trunk bridge for VM/data VLANs is fine. But keep management boring and recoverable.
If you can afford it, separate management from the big converged trunk.

Q3: What is ESXi VLAN 4095 and what’s the Proxmox equivalent?

ESXi VLAN 4095 means “pass VLAN tags to the guest” (guest trunk). On Proxmox, you typically do this by not setting a VLAN tag on the VM NIC
and letting the guest create VLAN subinterfaces—while ensuring the bridge/uplink passes tagged frames.

Q4: Does LACP double bandwidth for a single VM?

Not for a single TCP flow. LACP spreads flows across links based on hashing. One big flow usually sticks to one link. Multiple flows can utilize multiple links.

Q5: Is active-backup bonding “wasting” bandwidth?

It “wastes” potential aggregate throughput, yes. It also buys you simpler failure modes. For management, that trade is usually correct.
For heavy east-west VM traffic, consider LACP if you can operationally support it.

Q6: Why do VLAN problems sometimes look like DNS or application bugs?

Because partial connectivity is the worst kind. ARP might work, small packets might work, but MTU or filtering can break specific protocols.
Your app then fails in creative ways that make everyone blame everyone else.

Q7: Should I run jumbo frames for storage networks?

Only if you can prove end-to-end MTU consistency and you’ve measured a benefit. If you can’t guarantee that, run 1500 and optimize elsewhere.

Q8: How do I quickly prove whether tagging is happening in the right place?

Capture on the uplink: on Proxmox use tcpdump -eni bond0 vlan; on ESXi you can use appropriate capture tools or observe port group VLAN config.
If you see 802.1Q tags with the VLAN you expect, tagging is real. If not, your config is wishful thinking.

Q9: What’s the most migration-friendly Proxmox VLAN design?

VLAN-aware bridge with VM NIC tags, plus a clearly tagged management network. It maps cleanly to ESXi port groups and avoids a sprawl of bridges.

Q10: When would you prefer bridge-per-VLAN on Proxmox?

When you want extreme clarity, stricter separation, or you’re integrating with legacy processes where “this network has its own interface”
is how people think and audit.

14) Practical next steps

If you’re running ESXi, treat port groups as your source of truth: VLAN IDs, security settings, and teaming rules. Audit them, export them,
and stop letting “temporary” 4095 trunks become permanent architecture.

If you’re running Proxmox, embrace the Linux reality: use VLAN-aware bridges intentionally, keep IPs on the right interfaces, and make bonding
a deliberate choice (active-backup for boring networks, LACP where you can validate switch behavior and hashing).

Next steps that pay back immediately:

  • Pick and document one default VLAN strategy per platform (tag on port group vs tag on VM NIC) and one exception strategy (guest trunk) with strict rules.
  • Run the fast diagnosis playbook on a healthy host and save the outputs. Baselines make outages shorter.
  • For any LACP or jumbo-frame plan: test failure modes and MTU end-to-end before you declare victory.
← Previous
Fix Proxmox pve-firewall.service Failed Without Locking Yourself Out
Next →
ZFS zpool status -v: Finding the Exact File That’s Corrupt

Leave a comment