SWITCHPORT QUARTERLY / Issue 17 — Vol. IV / 14 May 2026 /  Field log: signed, sealed Contents Colophon
Postmortem  ·  private VPC peering  ·  incident #2026-0473

The 3am packet that bit every other Tuesday.

For eleven weeks our payments cluster lost 47 minutes of useful throughput on a fortnightly cadence. Pings were fine. TLS handshakes were fine. Then, around 03:04 UTC, the first SYN of every fresh flow began arriving like a postcard sent the long way. This is the log of how we found it — and why the answer was a stale ARP entry on a partner ISP's distribution switch in Frankfurt.

Filed from FRA-12 cage 4B Reading time: 22 minutes Words: 5,840
11wks recurrence window
47min avg degradation
2,318 pcaps reviewed
03:04utc consistent onset
14d cadence (fortnightly)
1mac offending ARP entry
§ 01

The first hint, May 5 — a customer ticket that should have been a ping spike.

Field log
05·05·2026

The ticket came in at 09:12 local from a customer support escalation, not from our pager. That was the first clue I should have weighed more carefully — Prometheus had not lit a single rule. The merchant had complained that batch settlement files from his POS gateway timed out "around three in the morning, but only sometimes." Sometimes is the worst word in network engineering. Sometimes means the failure has a phase the dashboards can't see.

I pulled the customer's flows from our Grafana vpc-east-flows panel and the picture looked entirely banal: 99.6% success rate over thirty days, p99 RTT at 4.8 ms, no retransmit storms. The connection between his POS and our payments cluster crossed the private peering we operate with NorthRing Telecom, a regional ISP whose Frankfurt PoP sits one cage over from ours. Two physical fibers, LACP'd at 10G, BGP iBGP for the public side, a static L2 peering for the private side. Stable since 2023.

The day before, May 4, had been a Monday. The customer's batch had run fine. So had every Monday batch. The Sunday batch had been fine. The Saturday batch had been fine. I scrolled the calendar back and noticed, with the slowed-down attention you get only on the third coffee, that the timeout events clustered: April 21, April 7, March 24, March 10. Tuesdays. Every other Tuesday. Eleven weeks of them, hidden by averages, hidden by the fact each event was small enough to look like a generic timeout.

I ran a quick fping -l -p 200 10.42.7.5 against the customer's gateway from inside our cluster. Eleven thousand replies, zero loss, mean 4.81 ms, max 5.94 ms. The peering was not down. It was, on a particular schedule, getting confused for the first 0.8–1.4 seconds of the first new flow of the morning, and only the first one. The rest of the batch then queued behind a stack of retried SYNs and the gateway's idempotency window slammed shut at 60 seconds. By the time anyone noticed, the link was again indistinguishable from healthy.

This is the kind of bug I prefer to lose to. It isn't pretending to be a bigger problem. It isn't waving its hands. It does its damage, hides, and lets you blame the application. I closed my laptop and made a note in the incident log: see again on Tuesday May 19, 02:50 UTC, with a capture running.

§ 02

Reading tcpdump like a diary, not a search query.

Capture lab
19·05·2026

On Tuesday May 19 I came in at 02:30 UTC with a thermos and a Cat6 console cable. The capture I had planned was deliberately wide — broader than I usually allow myself, because I'd rather scroll a gigabyte of pcap than re-run a missed event. I started two tcpdumps, one on each side of the peering.

# on the cluster side, 10.40.0.0/16 internal
sudo tcpdump -i ens5 -s 0 -w /var/cap/fra-05-19.pcap \
  'host 10.42.7.5 or net 192.0.2.0/24 or arp' \
  -G 1800 -W 4 -Z root # 30-min rotating, 2h retained
# on the gateway facing the NorthRing peering, ens6 is the 10G LACP
sudo tcpdump -i ens6 -nn -s 96 -tt -w /var/cap/peer-05-19.pcap \
  '(vlan 412) and (tcp port 4443 or arp or icmp)'

The capture filter on ens6 matters more than you might think. NorthRing tags the private peering on VLAN 412; without the vlan 412 predicate, BPF on Linux misses the dot1q-tagged ARP entirely, because the kernel had already de-tagged for the rest of the stack. This is the kind of thing you learn once, painfully, and never forget. I had learned it once, painfully, in 2019, in a different city, debugging a different problem.

cluster peering partner POS ┌───────────┐ ┌───────────┐ ┌───────────┐ │ pay-gw-01 │ 10G LACP → │ 7280SR3 │ ← VLAN 412 → │ NR-DIST-2 │ │ 10.40.7.4 │ ←─────────── │ fra-12 │ ─────────────→ │ 192.0.2.1 │ └───────────┘ └───────────┘ └───────────┘ │ │ │ │ pcap on ens5 │ │ ▼ ▼ ▼ payments-gw-01.pcap switch logs (syslog) opaque, partner-sidethis is where the bug lives

At 02:58:41 UTC the first sign appeared. A single ARP request from 192.0.2.1 asking for 192.0.2.5 — our side of the peering — broadcast on VLAN 412. Nothing unusual yet; ARP requests happen. What was unusual was that we had answered the same request from the same MAC fourteen times in the previous hour. Our gateway always replied within 40 microseconds. NorthRing's distribution switch, for some reason, was asking again every four to six minutes.

Then, at 03:04:09 UTC, two ARP replies arrived almost on top of each other for the same address. One came from our gateway's ec:0d:9a:73:1c:bf. The other came from 00:50:56:91:a0:42 — a VMware OUI. We do not run VMware on this network. We have not run VMware on this network since 2022. There it was, in the trace, calmly replying as if it owned the address. The L2 was forked.

§ 03

The Wireshark filters that finally bit — and the ones that wasted my night.

Decoding
20·05·2026

Wireshark is a kind of dictionary you have to know how to read aloud. I spent the first two hours of May 20 going down filter rabbit holes that looked productive but weren't. Here, for the public record, are the ones that did not help, followed by the four that did.

Filters tried during the May 19–20 capture review (selection)
Filter expression Role Hits Useful?
tcp.analysis.retransmission noise sweep 11,304 No — overwhelmed by unrelated client churn.
tcp.flags.syn == 1 and tcp.flags.ack == 0 flow opener 2,981 Partial — surfaced 03:04 SYNs but not their fate.
arp.duplicate-address-detected L2 forensic 2 Yes. This is the hammer.
arp.opcode == 2 and not arp.src.hw_mac == ec:0d:9a:73:1c:bf L2 forensic 7 Yes — exposed the ghost VMware MAC.
tcp.stream eq 4471 && tcp.flags.syn flow trace 3 Yes — three retried SYNs spaced 1.0, 2.0, 4.0s.
icmp.type == 3 and icmp.code == 1 unreachable 0 No — no host-unreachable from upstream.
frame.time_delta_displayed > 0.5 timing 38 Yes — clustered around 03:04:09–03:04:21.
vlan.id == 412 && arp scoped L2 416 Yes — refined the haystack to one VLAN.

The duplicate-address-detected expander told the story in one window. Wireshark had been quietly noticing, on every Tuesday capture, that 192.0.2.5 was being answered by two MACs within a few seconds of each other. We had two captures from previous Tuesdays — March 24 and April 7 — that I had filed and not opened. Both, when re-examined, contained the same two replies. The ghost MAC 00:50:56:91:a0:42 appeared, did its damage, and disappeared within a window of forty to sixty seconds. The distribution switch downstream, after this brief seizure, settled on our MAC again and the world looked normal.

tshark -r peer-05-19.pcap \
  -Y 'arp.opcode==2 and arp.src.proto_ipv4==192.0.2.5' \
  -T fields -e frame.time -e arp.src.hw_mac -e arp.src.proto_ipv4

# 03:04:09.412  ec:0d:9a:73:1c:bf  192.0.2.5   ← our gateway, expected
# 03:04:09.488  00:50:56:91:a0:42  192.0.2.5   ← who are you?
# 03:04:21.117  ec:0d:9a:73:1c:bf  192.0.2.5   ← gratuitous, ours
§ 04

The cadence: why every other Tuesday, and why 03:04?

Pattern analysis
21·05·2026

By Thursday morning I had a packet-level diagnosis but no why. The why is always the harder thing. A timeline helped — I sat down with the previous eleven incidents and the partner's public maintenance window page and worked through them in chronological order.

03·03·202603:04 UTC

First quiet incident — five SYN retries, no alarm.

Logged in retrospect. A single customer batch from Heidelberg Office Supply hit a 4-second handshake on its first flow; the rest of the night was clean. No ticket filed at the time.

Observation: Tuesday. NorthRing publishes a recurring change window every other Tuesday 03:00–04:00 UTC. Not visible to us until I went hunting.
17·03·202603:04 UTC

Second incident — two customers, both timed out within 90 seconds.

Both customers retried successfully on the next attempt. Total user-visible degradation: roughly 110 seconds. No SEV opened.

Observation: same minute, same vector. fortnightly alignment now obvious in hindsight.
31·03·202603:04 UTC

Third — a quiet pattern hardens.

Identical timing. The merchant gateway operator noted in his own logs: "appears as if the route flipped for 8 seconds." He blamed his ISP. So would I have.

Observation: 8-second flap matches our gratuitous-ARP retransmit. L2 race hypothesis filed in private notes.
14·04·202603:04 UTC

Customer ticket finally lands.

The Heidelberg merchant's batch fails three times consecutively before succeeding. His support team escalates to ours. Ticket misrouted to the application team for six days.

Observation: misrouting cost us a full cycle. Pager should have fired on L2 anomaly metric — we had none.
28·04·202603:04 UTC

Two merchants drop in tandem — first real escalation.

The pattern is now undeniable. The fortnightly cadence, the 03:04 onset, the L2 fingerprint. I open the May 19 capture plan.

Observation: cause and effect locked. Need partner ISP visibility. Email to NorthRing NOC sent 03·05·2026.
19·05·202602:58 → 03:04 UTC

The capture night — duplicate ARP captured live.

Two MACs, twelve seconds apart. The ghost VMware OUI confirmed in three independent pcap segments. Cluster gateway issues its own gratuitous ARP at 03:04:21 and the L2 forest settles.

Observation: pattern is a fixed-time event on partner side. ARP cache the prime suspect.
21·05·202614:00 UTC

NorthRing concedes a maintenance script.

After the capture and a screen-share, NorthRing's NOC identifies a legacy Ansible playbook that, every other Tuesday at 03:04, flushes-and-rebuilds the ARP table on NR-DIST-2 in Frankfurt. The flush is fine. The rebuild reads from a stale inventory file.

Observation: the playbook re-inserts a static ARP entry from 2022 — the retired VMware host. Root cause.

Once you see it, you cannot unsee it. The fortnightly cadence wasn't mysterious — it was a cron job. The 03:04 onset wasn't poetic — it was the third minute of a four-minute maintenance script. The ghost MAC was a literal ghost: an Ansible inventory entry from a server that had been decommissioned thirty-eight months earlier but never removed from a group_vars/static_arp.yml file. It returned, faithfully, every fourteen days, like a sleepwalker.

The fortnightly cadence wasn't mysterious. It was a cron job reading a stale inventory file from 2022.
— Postmortem summary, paragraph 14
§ 05

An ARP entry that refused to die — and what it taught us about state.

Root cause
22·05·2026

An ARP entry isn't supposed to have a life of its own. It is supposed to be transient, lazy, and replaced by whatever the wire most recently said. The whole point of the protocol is that the network is what it appears to be at any given moment, and the moment is short. We forget how fragile that contract is once a static override enters the picture.

NorthRing's playbook did not, as I'd first feared, push a broken configuration into running config. The damage was subtler. After the flush, the playbook walked the static_arp.yml manifest and inserted each entry into the running ARP table with a 240-second timeout. For 240 seconds, the distribution switch believed our peering address belonged to a server that had not existed for thirty-eight months. Any frame destined for 192.0.2.5 during that window was forwarded toward a MAC that no port on the switch had seen in years — and the switch, ever obliging, flooded those frames out every port in VLAN 412 looking for a home.

That flooding wasn't free. It pushed the per-port input queue on NorthRing's downstream distribution port up to about 38% of its 1G capacity for the duration of the flood. Most of the time, the queue absorbed it. Sometimes — and here is where the "every other Tuesday" pattern took its specific shape — a coincident burst of legitimate batch traffic on the same VLAN tipped the queue over its early-drop threshold, and SYNs were dropped silently. Not RST'd. Not unreachable'd. Dropped, like postcards into a drawer.

The cure was both trivial and embarrassing. NorthRing rewrote the playbook to read the static-ARP manifest from a diff against the live ARP table, rather than as a blind replace. They also, finally, removed the eleven ghost entries. We watched the next Tuesday — June 2, 03:04 UTC — together over a video call. The capture showed exactly one ARP request, exactly one reply from us, exactly zero ghost MACs. The fortnight after that, the same. The fortnight after that, the same.

If this story has a moral, it is the one every network engineer eventually carves into the inside of their forehead: infrastructure that survives its owner is infrastructure waiting to bite. The VMware host died in 2022. Its IP address survived in someone's YAML, and its MAC address survived in someone's arp -s, and on a fortnightly basis it remembered itself.

§ 06

Talking to the partner ISP — what we asked, what we got, what we paid for in coffee.

Vendor liaison
23·05·2026

A polite, evidence-laden e-mail to NorthRing's NOC at 03:00 UTC on Sunday May 3 was the single most leveraged action of this entire investigation. I attached three pcap excerpts, a timeline of incidents, and the exact filter that surfaced the ghost ARP. I did not speculate about cause. I described, in flat declarative prose, what I had observed.

I.letter

What I sent at 03:00 on Sunday.

One paragraph of context. Three pcap files, each under 4 MB, named for the date they were taken. One screenshot of the duplicate-address-detected expander. A single question: "Is there scheduled work on VLAN 412 fortnightly at 03:00 UTC?"

Reply time09 h 14 m
II.screen-share

What they showed me on Tuesday afternoon.

Their senior NOC engineer, Bartłomiej K., opened the Ansible playbook in front of me. Line 47 was a loop over static_arp.yml. He'd inherited the playbook from a colleague who'd left in late 2023. Nobody had touched static_arp.yml since.

Duration41 m
III.remediation

What we agreed to do, and by when.

Within 24 hours: ghost entries removed. Within seven days: playbook rewritten to diff. Within thirty days: shared dashboard for VLAN 412 anomalies. We met all three deadlines — they on day 2, day 4, day 21; we on day 9.

SLA hit100%
# Run on a payments gateway, watches for ghost MAC reappearance.
# Lives in /usr/local/sbin/arp-ghost-watch — runs from systemd timer.
#!/usr/bin/env bash
target="192.0.2.5"
expected_mac="ec:0d:9a:73:1c:bf"
ts="$(date -u +%FT%TZ)"
seen="$(arping -c 3 -I ens6 -f $target | awk '/reply/ {print $5; exit}')"
if [[ "$seen" != "$expected_mac" ]]; then
  logger -t arp-ghost-watch "GHOST $ts saw=$seen expected=$expected_mac"
  curl -sS -X POST "$ALERT_URL" -d "ghost-arp $ts $seen"
fi

The watcher has fired exactly zero true positives since June 2. It has fired twice during planned maintenance on our side and both alerts were correctly suppressed by the silencing label. I don't trust the silence. I trust the watcher. I will keep it running for at least a year.

§ 07

Lessons, runbooks, and the things we changed in the dark.

Aftercare
09·06·2026

A postmortem isn't worth filing if it doesn't change the rooms it touches. Here are the seven things this incident changed, ordered by usefulness rather than chronology.

Concrete changes filed against this incident (ID #2026-0473)
Change Role Cost (h) Owner
Add duplicate-address-detected as a Prom alert via a sidecar tshark.detection6.0Maren A.
Quarterly review of static-ARP manifests on both sides of all peerings.hygiene2.5 / qNetOps
Move every "every other Tuesday" partner maintenance into shared calendar with annotations on our Grafana.visibility3.5SRE liaison
Add gratuitous-ARP from our gateway every 60 s during 03:00–04:00 UTC on Tuesdays (temporary, until June 30 2026).mitigation1.5Maren A.
Customer-facing status page now distinguishes peering events from backbone events.comms4.0Status WG
Document this incident in the on-call runbook with the exact Wireshark filter that bit.runbook2.0Maren A.
Push for a formal L2 health metric in the next NorthRing peering contract renewal.commercial12.0Procurement

The thing I keep returning to, the thing I want every junior engineer at our cage to take from this story, is the cadence of the work itself. I did not discover the ghost ARP. I let the network discover it for me, by waiting until 02:30 UTC on a Tuesday with a thermos and two tcpdumps. The patient hours of capture were the entire investigation. Wireshark filters are tools, not searches; they help you read a thing you already have rather than fetch a thing you don't.

Frequently asked, often badly.

Why didn't you see this in Prometheus from day one?
We had no L2 metric. Our Prom rules covered ICMP loss, TCP retransmits as a ratio, BGP session state, and link utilization. The duplicate-MAC event was, by design, invisible to all of those. The mitigation: a sidecar tshark process exporting duplicate_address_detected_total per VLAN per minute, alerting on any non-zero value.
Couldn't you have just pinned the MAC with arp -s on your side?
We considered it for about ninety seconds. Pinning the MAC on our side only helps for traffic leaving us. The asymmetric direction — partner-to-us — still went through NorthRing's poisoned cache. The fix had to live on the partner's switch.
Why didn't the gratuitous ARP from your gateway clear the cache faster?
It did, eventually — within roughly twelve seconds. But the SYN drops happened in the first 0.8–1.4 seconds after the bad ARP reply was processed. By the time our gratuitous announcement settled the cache, the customer's SYN had already been queued, dropped, and was waiting on its first SYN-RST retransmit at 1.0 s.
What about BFD? Wouldn't BFD have caught this?
No. BFD watches the L3 reachability between two endpoints. Our peering's L3 was never down. The L2 was forked for 240 seconds and BFD packets between our gateways traversed it perfectly happily — they were unicast frames with the correct destination MAC the moment they entered the switch, and they came out the right port. The ghost only stole flows that depended on the broken ARP mapping for first-frame delivery.
Could a smaller MTU or jumbo frames have masked or worsened this?
Neither. MTU is irrelevant to ARP table state. We do run 1500-byte frames across the peering; the customer's batch payloads are large but fragment cleanly. The damage was confined to the first three frames of each new flow, regardless of size.
How do you know the playbook ran at 03:00 if you don't have access to the partner's logs?
We don't, directly. But the timing was a tell: the script logged its own start to NorthRing's syslog at 03:00:02 UTC every fortnight, and the syslog timestamps later shared with us by Bartłomiej K. line up exactly with our captures. We are no longer guessing.
Will you publish the pcaps?
Redacted excerpts, yes — they will be linked in the colophon under the title FRA-12 / NorthRing peering, May 19 2026 in October, once a vendor review clears. The full multi-gigabyte set will not be public; it contains customer flow metadata we cannot share.