You spin up a VM. It boots. It gets an IP. DNS looks fine. And yet: no internet. Ping to the gateway times out like it’s on strike. You start muttering about “Ubuntu networking changes” and “maybe the image is broken,” and before you know it you’ve restarted libvirtd three times and learned nothing.
This is nearly always a layer problem: wrong bridge design, wrong VLAN handling, wrong default route, or your host firewall quietly doing its job a little too well. The fix is not “try random config until it works.” The fix is to build the bridge/VLAN properly, then verify it like an SRE: one hop, one table, one decision at a time.
A sane mental model: what must be true for a VM to reach the internet
When a VM “has no internet,” it’s tempting to treat it like a single problem. It isn’t. It’s a chain of conditions. Break any link and your VM becomes a very expensive localhost.
The non-negotiable chain
- Guest link is up: virtio/e1000 NIC present, carrier on, correct MAC, correct driver.
- Guest has correct IP configuration: IP/mask, default route, DNS.
- Guest traffic egresses the guest: ARP/ND works, packets leave the guest NIC.
- Host forwards L2 correctly: the VM tap/vnet interface is enslaved to the right bridge; bridge is up and forwarding.
- VLAN tagging matches reality: untagged vs tagged frames are exactly what the upstream switch expects.
- Host has an uplink to the real network: bridge is attached to the physical NIC (or to the right VLAN subinterface), with carrier.
- Upstream switch port is correct: access vs trunk, allowed VLANs, native VLAN behavior is known (not assumed).
- Firewalling isn’t silently blocking: host nftables/iptables, libvirt filters, cloud-init guest firewall, or rp_filter.
- Path beyond the gateway works: NAT, routing, or upstream ACLs permit the traffic.
Most outages land in the middle: the host bridge is fine but VLAN tagging is wrong, or the VM is plugged into the wrong bridge, or libvirt NAT is used when you think you’re bridged.
One operational principle: stop guessing where the packet died. Put eyes on each hop until you find the first point where reality diverges from your mental model. That’s the fix.
Interesting facts and context (because the past is still on your network)
- Linux bridging has been around since the early 2000s, and it grew up alongside virtualization; KVM didn’t invent the need, it just industrialized it.
- VLAN tagging is older than most cloud “networking products”. IEEE 802.1Q showed up in the late 1990s and still wins on simplicity: one wire, many networks.
- Libvirt’s default network (virbr0) is NAT. It’s great for laptops and terrible for “my VM must be on the same LAN as everything else.”
- Netplan is not a network daemon. It’s a configuration translator that typically targets systemd-networkd on servers (or NetworkManager on desktops).
- Linux bridges can filter and tag VLANs (bridge VLAN filtering). You can do “switch-like” behavior in the kernel, with per-port VLAN membership.
- Spanning Tree Protocol (STP) isn’t just for physical switches. A Linux bridge can participate, and enabling STP in the wrong place can add seconds of “why is nothing passing?” after link-up.
- Reverse path filtering (rp_filter) has caused more “but ping works one way” incidents than anyone likes to admit, especially on multi-homed hosts.
- Systemd’s network stack matured a lot over the last decade. What used to require handcrafted /etc/network/interfaces glue is now declarative and predictable—if you actually declare the right thing.
- VLAN “native” behavior differs by vendor and by configuration. The word “native” is where assumptions go to die.
Also: VLANs are like org charts. Everyone thinks they’re simple until they have to change one.
Quote (paraphrased idea): Werner Vogels often pushes the idea that “you build it, you run it”—the operational feedback loop is part of engineering, not an afterthought.
Fast diagnosis playbook
This is the “I need signal in five minutes” flow. It assumes the VM is supposed to be bridged to a real network, optionally via VLAN.
First: prove the VM is actually on the bridge you think it is
- Check the VM’s vnet/tap interface exists on the host.
- Confirm it’s enslaved to the right bridge (not virbr0, not some orphan bridge).
- Confirm the bridge has the physical uplink attached (or correct VLAN subinterface attached).
Second: find the first failed hop from inside the VM
- Ping the default gateway.
- If that fails, look at ARP/neighbor table and capture traffic on the VM interface and the uplink.
- If the gateway works but internet doesn’t, check DNS vs routing vs upstream ACLs.
Third: validate VLAN expectations on host and switch
- If the VM is untagged, the bridge/uplink must be access/native on the right VLAN.
- If the VM is tagged, either the guest tags frames (802.1Q inside the guest) or the host bridge applies tags (bridge VLAN filtering). Pick one approach and stick to it.
- Confirm allowed VLAN list on the switchport; “trunk” without “allowed” is how VLANs disappear.
Fourth: rule out host firewall and kernel knobs
- Check nftables ruleset for drops on bridge traffic.
- Confirm bridge netfilter settings; you may be filtering L2 transit without realizing it.
- Check rp_filter on host if multi-homed or policy routing exists.
If you follow that sequence, you usually find the culprit before your coffee cools.
Hands-on tasks: commands, expected output, and what you decide next
These are real operational tasks. Run them in order until the output tells you where the packet stops. Each task includes: command(s), what output means, and the decision you make.
Task 1: Identify the host uplink and link state
cr0x@server:~$ ip -br link
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
enp3s0 UP 3c:ec:ef:12:34:56 <BROADCAST,MULTICAST,UP,LOWER_UP>
br0 UP 3c:ec:ef:12:34:56 <BROADCAST,MULTICAST,UP,LOWER_UP>
vnet3 UP fe:54:00:aa:bb:cc <BROADCAST,MULTICAST,UP,LOWER_UP>
Meaning: enp3s0 has carrier (LOWER_UP). br0 exists and is UP. If your physical NIC is DOWN/NO-CARRIER, stop here; you don’t have an uplink.
Decision: If the uplink is down, fix cabling/switch/driver first. If it’s up, proceed to bridge membership.
Task 2: Confirm bridge membership (who is plugged into br0)
cr0x@server:~$ bridge link
3: enp3s0 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> master br0
5: vnet3 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> master br0
Meaning: Both the physical NIC and VM interface vnet3 are on br0. If vnet3 is on virbr0 or no master, the VM is not on your intended network.
Decision: If the VM is not enslaved to the correct bridge, fix the libvirt network attachment before touching VLANs or firewalling.
Task 3: Verify host IP configuration lives on the bridge (not the physical NIC)
cr0x@server:~$ ip -br addr show br0 enp3s0
br0 UP 192.0.2.10/24 fe80::3eec:efff:fe12:3456/64
enp3s0 UP fe80::3eec:efff:fe12:3456/64
Meaning: The host’s IPv4 address is on br0. That’s what you want for a simple bridged design. If IPv4 sits on enp3s0 while br0 is also in play, you’re in “it might work until it doesn’t” territory.
Decision: If the host IP is on the wrong interface, fix netplan so the bridge owns the IP.
Task 4: Check that the VM sees an IP, route, and DNS
cr0x@server:~$ virsh domifaddr vm1
Name MAC address Protocol Address
-------------------------------------------------------------------------------
vnet3 fe:54:00:aa:bb:cc ipv4 192.0.2.50/24
Meaning: The VM has an IPv4 address. That doesn’t prove it can reach the gateway, but it proves DHCP (or static config) happened.
Decision: If there’s no address, debug guest DHCP/VLAN before blaming the host firewall.
Task 5: From inside the VM, test first hop (gateway) and second hop (public IP)
cr0x@server:~$ ping -c 3 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.547 ms
64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=0.510 ms
64 bytes from 192.0.2.1: icmp_seq=3 ttl=64 time=0.522 ms
--- 192.0.2.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2040ms
cr0x@server:~$ ping -c 3 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=55 time=12.1 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=55 time=12.0 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=55 time=12.2 ms
Meaning: If gateway ping fails, you’re dealing with L2/VLAN/bridge attachment or a local firewall. If gateway works but public IP fails, it’s routing/NAT/upstream ACL.
Decision: Choose the branch. Don’t troubleshoot DNS when you can’t hit the gateway.
Task 6: Inspect ARP/neighbor table for the gateway MAC
cr0x@server:~$ ip neigh show
192.0.2.1 dev eth0 lladdr 00:11:22:33:44:55 REACHABLE
Meaning: REACHABLE with a MAC means ARP completed. If it’s INCOMPLETE/FAILED, the VM is shouting into the void (or VLAN mismatch is eating frames).
Decision: If ARP is incomplete, capture traffic and verify VLAN tagging.
Task 7: Capture traffic on the host vnet interface (is the VM emitting ARP?)
cr0x@server:~$ sudo tcpdump -ni vnet3 -c 10 arp or icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vnet3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:00:01.000000 ARP, Request who-has 192.0.2.1 tell 192.0.2.50, length 28
12:00:02.000000 ARP, Request who-has 192.0.2.1 tell 192.0.2.50, length 28
Meaning: The VM is sending ARP. If you never see ARP here, the problem is inside the guest (wrong interface, down link, wrong route) or the VM isn’t actually connected.
Decision: If ARP exists on vnet but no ARP reply returns, the drop is between bridge and uplink (VLAN, switchport, or host filtering).
Task 8: Capture on the host uplink (does ARP leave the box?)
cr0x@server:~$ sudo tcpdump -ni enp3s0 -c 10 arp
listening on enp3s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:00:01.000100 ARP, Request who-has 192.0.2.1 tell 192.0.2.50, length 28
Meaning: Seeing ARP on enp3s0 means the bridge forwarded it out. If you see it on vnet3 but not on enp3s0, the bridge is blocking or VLAN filtering is misconfigured.
Decision: If it doesn’t leave the box, inspect bridge VLAN filtering and STP state.
Task 9: Inspect bridge forwarding database (FDB) and MAC learning
cr0x@server:~$ bridge fdb show br br0 | head
fe:54:00:aa:bb:cc dev vnet3 master br0
00:11:22:33:44:55 dev enp3s0 master br0
Meaning: The bridge learned the VM MAC on vnet3 and the gateway MAC on enp3s0. If the gateway MAC never appears, upstream replies aren’t making it back (switchport/VLAN) or are being filtered.
Decision: No learned MACs = look for VLAN mismatch or STP blocking.
Task 10: Check VLAN configuration on the host (VLAN devices + bridge VLAN filtering)
cr0x@server:~$ ip -d link show br0 | sed -n '1,12p'
6: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 3c:ec:ef:12:34:56 brd ff:ff:ff:ff:ff:ff
bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 1 vlan_protocol 802.1Q
cr0x@server:~$ bridge vlan show
port vlan-id
enp3s0 10 PVID Egress Untagged
vnet3 10 PVID Egress Untagged
br0 10 PVID Egress Untagged
Meaning: vlan_filtering=1 means the bridge is acting like a VLAN-aware switch. PVID 10 untagged on both ports means untagged traffic is treated as VLAN 10 and leaves untagged.
Decision: If your switchport expects tagged VLAN 10 but you’re sending untagged, fix either the switchport or the bridge VLAN settings. Pick one truth.
Task 11: Check netfilter/bridge settings that can silently drop bridged traffic
cr0x@server:~$ sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.bridge.bridge-nf-call-arptables
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-arptables = 0
Meaning: With bridge-nf-call-iptables=1, your host firewall can filter forwarded bridged frames. That can be fine, but it’s a common “why only VMs break?” cause.
Decision: If you rely on nftables for host security, keep this enabled but ensure your rules allow the VM traffic. If you don’t, disable it deliberately and document why.
Task 12: Inspect nftables rules for drops affecting forwarding
cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
chain input {
type filter hook input priority filter; policy drop;
iif "lo" accept
ct state established,related accept
tcp dport 22 accept
}
chain forward {
type filter hook forward priority filter; policy drop;
ct state established,related accept
iifname "br0" oifname "br0" accept
}
chain output {
type filter hook output priority filter; policy accept;
}
}
Meaning: Forward chain default drop is fine if explicit accepts exist. Here, intra-bridge traffic is accepted, but traffic from br0 to enp3s0 is not (because oifname isn’t br0). That will kill VM egress.
Decision: Add a rule to allow br0 → uplink forwarding (or a more precise rule per VLAN/subnet). Don’t “policy accept” your way out of thinking.
Task 13: Confirm libvirt network type (bridged vs NAT)
cr0x@server:~$ virsh domiflist vm1
Interface Type Source Model MAC
-------------------------------------------------------
vnet3 bridge br0 virtio fe:54:00:aa:bb:cc
Meaning: Type=bridge Source=br0 means the VM is actually bridged. If you see Type=network Source=default, you’re on libvirt NAT (virbr0).
Decision: If it’s NAT but you expect bridged, change the VM NIC network source.
Task 14: Verify routes on the host (when the host is also your router/NAT)
cr0x@server:~$ ip route
default via 192.0.2.1 dev br0 proto dhcp src 192.0.2.10 metric 100
192.0.2.0/24 dev br0 proto kernel scope link src 192.0.2.10
Meaning: Simple case: host default route is via the LAN gateway. If your design uses the host as a router for VM VLANs, you’ll need additional routing and IP forwarding.
Decision: If the host is meant to route between VLANs, confirm forwarding and correct return routes reminding the upstream network where VM subnets live.
Task 15: Check IP forwarding and rp_filter on the host (routing cases)
cr0x@server:~$ sysctl net.ipv4.ip_forward net.ipv4.conf.all.rp_filter net.ipv4.conf.br0.rp_filter
net.ipv4.ip_forward = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.br0.rp_filter = 1
Meaning: ip_forward=0 means the host won’t route packets. rp_filter=1 can drop asymmetric traffic (common with policy routing or multiple uplinks).
Decision: If the host is a router, enable ip_forward and set rp_filter appropriately (often 2 for loose mode in multi-homed scenarios).
Task 16: Validate DHCP actually comes from the expected VLAN
cr0x@server:~$ sudo tcpdump -ni enp3s0 -c 20 udp port 67 or udp port 68
listening on enp3s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:05:01.000000 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request
12:05:01.050000 IP 192.0.2.2.67 > 192.0.2.50.68: BOOTP/DHCP, Reply
Meaning: You can see the DHCP server answering. If DHCP works but gateway ARP fails, you might be getting DHCP from a relay or a different VLAN than you think.
Decision: If the DHCP server IP surprises you, stop and reconcile VLAN and switchport configuration.
Joke #1: DHCP is like office coffee: when it’s missing, everyone suddenly becomes a network engineer.
The right design patterns (bridge + VLAN) for Ubuntu 24.04 hosts
There are three patterns that work reliably. The bad pattern is “a little bit of each” because you copied a snippet from three different blog posts at 2 a.m.
Pattern A: Untagged bridge (single VLAN / access port)
Use when: your virtualization host and VMs live on one network, and the switchport is an access port (or trunk with a well-defined native VLAN that you actually control).
Design: physical NIC → bridge br0; br0 holds host IP; VMs attach to br0. No VLAN filtering. No VLAN subinterfaces.
Why it’s good: fewer moving parts. Less to mis-tag. Fewer “allowed VLANs” arguments with the network team.
Failure mode: someone changes the switchport to a trunk or moves you to a different VLAN and doesn’t tell you; you stay untagged and end up in the wrong place.
Pattern B: VLAN subinterface on uplink + per-VLAN bridge
Use when: you want VMs in multiple VLANs, but you’d rather keep VLAN logic simple and explicit.
Design: enp3s0 is a trunk on the switch. Create VLAN interfaces on host (enp3s0.10, enp3s0.20). Create bridges br10 and br20, each attached to the corresponding VLAN interface. Attach VMs to br10 or br20.
Why it’s good: very easy to reason about. You can see what’s tagged where. It’s also easy to firewall per bridge if you must.
Failure mode: people forget to allow the VLAN on the switch trunk. Or they attach the VM to br0 (wrong bridge) and then swear VLANs are “broken.”
Pattern C: VLAN-aware bridge with bridge VLAN filtering
Use when: you want one bridge and you want it to behave like a small switch: trunk uplink, access ports for VMs, possibly trunks to special VMs, and VLAN membership controlled on the host.
Design: enp3s0 is attached directly to br0; br0 has vlan_filtering=1. You assign VLAN membership per port (enp3s0 and vnetX). VMs can be untagged (access) or tagged (trunk) depending on how you configure them.
Why it’s good: powerful, scalable, clean topology. One bridge, many VLANs, fewer interfaces.
Failure mode: powerful means easy to shoot yourself. A missing PVID or wrong egress tagging rule makes traffic vanish with no drama and no apology.
My operational bias: if you’re small-to-medium and want fewer surprises, Pattern B is the sweet spot. If you’re building a platform and you can afford disciplined configuration management and testing, Pattern C is excellent.
Joke #2: VLANs don’t “randomly break.” They break in ways that are perfectly deterministic—just not necessarily documented.
Netplan examples that work (and why)
Ubuntu 24.04 typically uses netplan to generate systemd-networkd configuration on servers. The most common mistake is mixing NetworkManager assumptions with networkd reality, or leaving half-configured interfaces around.
Example 1: Simple untagged bridge br0
Switchport: access VLAN X (untagged).
Host: IP lives on br0 via DHCP or static.
cr0x@server:~$ sudo cat /etc/netplan/01-br0.yaml
network:
version: 2
renderer: networkd
ethernets:
enp3s0:
dhcp4: no
dhcp6: no
bridges:
br0:
interfaces: [enp3s0]
dhcp4: yes
parameters:
stp: false
forward-delay: 0
Why: enp3s0 has no IP; br0 does. STP off avoids 30+ seconds of “why can’t I reach anything after reboot?” on simple topologies.
Example 2: Per-VLAN bridges (Pattern B)
Switchport: trunk; VLANs 10 and 20 allowed/tagged.
Host: management on VLAN 10; VMs also use VLAN 20.
cr0x@server:~$ sudo cat /etc/netplan/01-vlan-bridges.yaml
network:
version: 2
renderer: networkd
ethernets:
enp3s0:
dhcp4: no
dhcp6: no
vlans:
enp3s0.10:
id: 10
link: enp3s0
enp3s0.20:
id: 20
link: enp3s0
bridges:
br10:
interfaces: [enp3s0.10]
dhcp4: yes
parameters:
stp: false
forward-delay: 0
br20:
interfaces: [enp3s0.20]
dhcp4: no
dhcp6: no
parameters:
stp: false
forward-delay: 0
Why: br10 and br20 are explicit. Your VM goes on br20, and you can’t accidentally land it on the management network unless you choose to.
Example 3: VLAN-aware bridge (Pattern C) with host management VLAN only
This is the advanced move. The host’s own IP sits on a VLAN interface (management VLAN), while the bridge carries multiple VLANs for VMs.
cr0x@server:~$ sudo cat /etc/netplan/01-vlan-aware-bridge.yaml
network:
version: 2
renderer: networkd
ethernets:
enp3s0:
dhcp4: no
dhcp6: no
bridges:
br0:
interfaces: [enp3s0]
dhcp4: no
parameters:
stp: false
forward-delay: 0
vlans:
br0.10:
id: 10
link: br0
dhcp4: yes
Why: br0 is purely L2; br0.10 is the host’s management L3 presence. VMs can be attached to br0 and placed into VLANs via bridge VLAN filtering (configured outside netplan, typically via networkd or explicit bridge commands at boot).
Operational warning: netplan doesn’t express everything you may want for bridge VLAN filtering per port. If you go Pattern C, treat it like a small switching platform: configure it consistently and test after every change.
Applying netplan safely
cr0x@server:~$ sudo netplan try
Do you want to keep these settings?
Press ENTER before the timeout to accept the new configuration
Changes will revert in 120 seconds
Meaning: netplan try gives you a rollback window. Use it on remote systems unless you enjoy out-of-band consoles.
Decision: If connectivity drops, wait for rollback and fix your YAML calmly.
Libvirt/KVM attachment: avoiding the “it’s on virbr0” trap
Libvirt defaults are optimized for “developer laptop runs a VM with internet via NAT.” In production, you usually want bridged networking so your VM is a first-class citizen on the LAN/VLAN.
Check what networks exist
cr0x@server:~$ virsh net-list --all
Name State Autostart Persistent
--------------------------------------------
default active yes yes
Meaning: The default libvirt NAT network exists. That’s not evil; it’s just often not what you want.
Decision: Decide explicitly: NAT (default) vs bridge (br0/br10/br20). Don’t let libvirt decide by accident.
Attach a VM NIC to a bridge (example)
cr0x@server:~$ virsh attach-interface --domain vm1 --type bridge --source br20 --model virtio --config
Interface attached successfully
Meaning: Persistent config changed. You may still need to detach the old NIC or reboot depending on how the guest handles hotplug.
Decision: After changes, verify with domiflist and then validate from inside the VM.
Confirm VM NIC landed where you intended
cr0x@server:~$ virsh domiflist vm1
Interface Type Source Model MAC
-------------------------------------------------------
vnet3 bridge br20 virtio fe:54:00:aa:bb:cc
Meaning: You’re bridged to br20 now. If internet still fails, it’s not because you accidentally stayed on virbr0.
Decision: Continue with VLAN and firewall checks.
Common mistakes: symptom → root cause → fix
1) VM gets an IP but cannot ping the gateway
Symptom: DHCP works, but first-hop ping fails.
Root cause: DHCP is coming from somewhere else (relay, wrong VLAN), or gateway is in a different VLAN than the VM’s effective VLAN.
Fix: Capture ARP on vnet and uplink. Confirm VLAN tagging and switchport mode. Align access/native VLAN with the untagged bridge, or tag VLAN correctly.
2) VM can ping gateway but not public IPs
Symptom: L2 is fine; L3 beyond gateway fails.
Root cause: Upstream routing/NAT/ACL issue, or wrong default route in guest.
Fix: Check guest routing table. Check upstream ACLs. If host is doing routing/NAT, enable ip_forward and add NAT rules intentionally.
3) VM can reach public IPs but DNS fails
Symptom: ping 1.1.1.1 works, ping a hostname fails.
Root cause: DNS servers unreachable, wrong resolv.conf/systemd-resolved config, or blocked UDP/TCP 53.
Fix: Query DNS directly; check firewall for UDP/TCP 53. Validate systemd-resolved status in the guest.
4) Only some VLANs work; others are dead
Symptom: VLAN 10 fine, VLAN 20 dead across all VMs.
Root cause: Switch trunk allowed list missing VLAN 20, or host bridge VLAN table missing membership for VLAN 20.
Fix: Add VLAN to switch allowed list and host VLAN configuration. Verify with bridge vlan show and tcpdump with vlan filter.
5) Internet works until reboot; then VMs are isolated
Symptom: Manual bridge/vlan tweaks worked, but didn’t persist.
Root cause: Runtime bridge commands weren’t encoded in netplan/systemd-networkd units; reboot wipes state.
Fix: Make configuration declarative (netplan + networkd drop-ins) and version-controlled. Test with reboot as part of the change.
6) Host can reach the network; VMs cannot
Symptom: Host ping works; VM ping fails; bridge looks okay.
Root cause: nftables forward policy dropping bridged forwarding, or bridge netfilter interacting with firewall rules.
Fix: Inspect nft forward chain and bridge-nf sysctls. Add explicit accept rules for VM subnets/bridges.
7) VM traffic is intermittent; ARP flaps
Symptom: Sometimes gateway reachable; sometimes not; MAC addresses seem to move.
Root cause: Duplicate IPs, MAC spoofing filters on switch, or multiple bridges uplinked creating a loop (and STP not protecting you).
Fix: Check for duplicate ARP replies, audit switch security features, and ensure you have exactly one L2 path to the VLAN unless you’re intentionally doing redundancy.
8) Everything works, but performance is terrible
Symptom: High latency, low throughput, drops under load.
Root cause: MTU mismatch (especially with VLAN tagging), offload quirks, or the host CPU burning cycles due to firewalling/conntrack on bridged traffic.
Fix: Verify MTU end-to-end. Evaluate offload settings. Don’t run a surprise stateful firewall in the forwarding path without sizing it.
Checklists / step-by-step plan
Step-by-step: Build a bridged VM network on a single VLAN (boring and correct)
- Pick Pattern A (untagged) if you only need one VLAN.
- Configure netplan so the physical NIC has no IP and is enslaved to br0.
- Apply netplan with
netplan try. - Verify:
ip -br addrshows IP on br0. - Attach VM NIC to br0 (bridge type in libvirt).
- Verify:
bridge linkshows vnetX master br0. - Test in VM: ping gateway, then public IP, then DNS.
- If failure: tcpdump on vnetX and uplink to find first missing frame.
Step-by-step: Add VLANs without making future-you miserable
- Choose Pattern B unless you have a strong reason for VLAN-aware bridging.
- Get the switchport configured as trunk with explicit allowed VLAN list.
- Create VLAN subinterfaces on the host (enp3s0.10, enp3s0.20).
- Create per-VLAN bridges (br10, br20). Put host management on one VLAN only.
- Attach VMs to the correct bridge for their VLAN. Do not “just use br0 for everything.”
- Document VLAN → bridge mapping in the repo that stores netplan configs.
- Test with reboot. Always.
Validation checklist (run after every change)
- Host: bridge membership correct (
bridge link). - Host: IP on the bridge (or on bridge VLAN interface, intentionally).
- Host: VLAN membership aligns with switchport expectations.
- Host: firewall forward chain permits required flows.
- Guest: correct route + DNS.
- Packet: ARP leaves VM, leaves host, and replies come back.
Three corporate mini-stories from the networking trenches
Mini-story #1: The incident caused by a wrong assumption
In one shop, a virtualization cluster was “standardized” on a trunk uplink: VLAN 10 for management, VLAN 20 for workloads. The engineer who built the first host assumed the switchport had a native VLAN of 20 because “that’s what we usually do.” They built Pattern A: untagged br0, VMs attached, no VLAN tagging anywhere.
It worked in their rack. It failed in the next rack. Same server model, same netplan file, same hypervisor build. Half the VMs had internet; half couldn’t even ARP the gateway. The networking team swore nothing had changed. Then someone asked the only question that mattered: “Are the switchports actually identical?”
They weren’t. One rack had native VLAN 20. Another had native VLAN 10. A third had no native VLAN configured the same way because a different template had been used months earlier. Untagged traffic landed wherever the switch decided it belonged, which is a polite way of saying “you’re not in control.”
The fix was not “more retries” or “new VM images.” The fix was to stop depending on native VLAN behavior. They moved to Pattern B: explicit tagged VLAN subinterfaces and per-VLAN bridges. It took a maintenance window, but after that, racks stopped being snowflakes. The postmortem had one key line: assumptions are configuration, just undocumented.
Mini-story #2: The optimization that backfired
Another org wanted to reduce interface sprawl on their hosts. They had a bridge per VLAN, and someone decided that was “too many devices.” So they moved to a single VLAN-aware bridge with filtering, configured via a set of runtime scripts triggered at boot. It looked clean: one bridge, trunk uplink, VMs assigned VLANs dynamically.
Then the first real outage: after a kernel update and a reboot, a subset of hosts came up with missing VLAN entries on some vnet ports. No one noticed immediately because the hosts themselves were fine on the management VLAN. But tenant VMs on certain VLANs were isolated. The symptoms were classic: DHCP timeouts, ARP incomplete, no gateway reachability.
The root cause wasn’t Linux “forgetting VLANs.” It was their own scripting and ordering. systemd-networkd brought up the bridge, libvirt spawned VMs early, and the VLAN membership script ran after the VMs were already attached—without backfilling port VLAN rules reliably. Some ports had PVID set, some didn’t. A race condition became a network policy.
They rolled back to Pattern B for most clusters and kept Pattern C only where they had time to encode VLAN rules in a deterministic, versioned, boot-order-safe method. The optimization didn’t fail because VLAN-aware bridging is bad. It failed because the deployment method wasn’t as boring as it needed to be.
Mini-story #3: The boring but correct practice that saved the day
A financial services team ran KVM on Ubuntu with bridged networking and VLANs. Nothing glamorous. But they had one habit: after every network change, they ran a standard validation script that captured three things—bridge membership, VLAN table, and nftables forward policy—and stored the output with the change ticket.
One morning, a set of VMs lost outbound connectivity. The host looked “up.” The bridge was “up.” The switchport was “up.” This is where teams start rebooting things until something changes. They didn’t. They compared last known-good outputs to current outputs.
The diff was small and decisive: the forward chain policy had changed to drop, and the accept rule for br0 → uplink wasn’t present. It turned out a host hardening role had been updated and applied broadly; it was correct for standalone servers and incorrect for virtualization hosts forwarding bridged traffic.
The fix took minutes: add the missing forward accept rule (properly scoped), redeploy the hardening role with host-type awareness, and restore service. The boring practice wasn’t genius. It was simply evidence. Evidence beats panic every time.
FAQ
1) Should I use NetworkManager or systemd-networkd on Ubuntu 24.04 servers?
Use systemd-networkd (via netplan) for servers unless you have a specific reason to standardize on NetworkManager. Mixing them is how “ghost configs” happen.
2) My VM is attached to br0, but it still uses 10.0.2.0/24 addresses. Why?
That’s typically libvirt NAT (default network). Check virsh domiflist. If the source is default and type is network, you’re not bridged.
3) Do I need STP on a Linux bridge?
Usually no for a single uplink, single bridge, no loops. STP can add forwarding delays and confusion. Enable it only when you know you have potential L2 loops and you want protection.
4) What’s the cleanest way to put VMs into different VLANs?
Per-VLAN bridges (Pattern B) are the cleanest operationally. VLAN-aware bridge filtering (Pattern C) is powerful but demands more disciplined configuration.
5) Can the guest do VLAN tagging itself?
Yes. You can present a trunk to a VM and let it create VLAN subinterfaces inside the guest. That’s appropriate for router/firewall appliances or Kubernetes nodes that manage their own VLANs. For ordinary VMs, keep VLAN logic on the host or upstream network.
6) Why does DHCP work but ARP to the gateway fails?
Because DHCP can succeed via relay paths or mis-tagged acceptance in a way that doesn’t guarantee your L2 adjacency to the gateway. Prove VLAN correctness with tcpdump and ARP state, not with “it got an IP.”
7) Is it safe to filter VM traffic with nftables on the host?
Yes, if you do it intentionally and you understand whether bridged frames traverse iptables/nftables via bridge-nf settings. The unsafe version is “policy drop” with missing forward accepts.
8) How do I know if the problem is the switchport or the host?
Capture on both sides of the host bridge: vnet interface and physical uplink. If packets leave vnet but not the uplink, it’s host-side. If they leave the uplink but replies never return, it’s upstream (switch/VLAN/ACL).
9) Should my host have an IP address on every VLAN my VMs use?
No. In a pure L2 bridged design, the host doesn’t need L3 presence on VM VLANs. Add host IPs only when you have a clear operational need (monitoring, routing, services) and you can secure it.
10) What about MTU and VLAN overhead?
802.1Q tagging adds overhead; mismatched MTUs can cause weird partial failures (especially with PMTUD blocked). If you run jumbo frames, verify MTU end-to-end across host NIC, bridge, VM NIC, and switchport.
Conclusion: next steps that won’t hurt later
If your Ubuntu 24.04 VM “has no internet,” resist the urge to reconfigure everything at once. Your job is to find the first broken hop, then fix the design so it stays fixed.
- Pick a pattern: A (untagged), B (per-VLAN bridges), or C (VLAN-aware bridge). Don’t blend them casually.
- Make the host configuration declarative: netplan + networkd, applied with
netplan try. - Prove bridge membership: vnet interface on the correct bridge; physical uplink attached.
- Prove VLAN reality: bridge VLAN table matches switchport mode and allowed VLANs.
- Prove forwarding isn’t blocked: nftables forward chain and bridge-nf settings align with your intent.
- Operationalize validation: keep a small set of commands you run after every change, and store the output.
When you do it this way, “VM has no internet” stops being a mystery and becomes what it should have been all along: a straightforward fault isolation exercise with a permanent fix at the end.