March 26, 2026 · 6 min

Automating Network Config on Live Hardware

Patrick McClory

The destination was never in question. The uncertainty was entirely in the path: how Ansible, RouterOS, and a task sequence would negotiate the journey from current state to desired state on live hardware.

networking automation infrastructure operations platform-engineering

Automating Network Config on Live Hardware

Automating network configuration on live hardware is a specific class of problem. Not because the concepts are hard. The Ansible roles, the RouterOS API, the data model for VLANs and interfaces. All of that is tractable. The specific challenge is that the hardware is live while you’re working on it, which means intermediate states exist on the device between where you started and where you’re trying to go.

This is the story of getting the Quorum Systems network from bootstrapped to fully automated, the three times I lost connectivity in the process, and why none of it was particularly stressful.

The destination was never in question. The uncertainty was entirely in the path.

The Safety Net Came First

Before the first Ansible run against live hardware, the recovery infrastructure was already in place. IPMI on the R630 running RouterOS. A known-good bootstrap network state, the minimum viable config from the on-site session,, version-controlled and ready to be reapplied. If everything went sideways, the worst case was zeroing the config and coming back to a clean starting point.

Running RouterOS on x86 hardware is useful here. IPMI is standard infrastructure on enterprise servers, not a proprietary vendor tool. When you lose network connectivity to the router, and when you’re automating live network config you should plan for that, you have a reliable, familiar out-of-band path back in regardless of what state RouterOS is in.

The math going in: if the sequence works the way I think it might, I win. If not, I know exactly what to do. That’s not recklessness. That’s the mental model that becomes available when you’ve designed the safety infrastructure before you need it. The IPMI didn’t just give me a recovery path. It gave me the freedom to experiment aggressively without anxiety about the outcome.

You can move fast on live hardware when the blast radius is bounded. The safety net is what makes the speed possible.

What MikroTik Taught Me About Sequencing

The three connectivity losses weren’t surprises in the sense that I knew the risk was real. They were surprises in the sense that my mental model of how MikroTik would execute the changes was wrong in a specific and instructive way.

I was thinking about the before state and the after state. MikroTik was executing every state in between.

RouterOS via the community.routeros api_modify collection applies changes per edit, not per transaction. When you’re moving interfaces around, reassigning logical interfaces, adjusting address assignments, restructuring the routing table, each individual edit lands on the device immediately. The intermediate states between “where you started” and “where you’re trying to go” are real states that the router passes through. If an intermediate state breaks connectivity, you find out the hard way.

My original task sequencing was what I’d call arbitrarily logical. It made sense as a planning exercise, ordered by what felt like a reasonable progression from the perspective of the desired end state. What it didn’t account for was the specific ways RouterOS handles interface moves mid-sequence. Add and then remove, in that order, matters. The sequence of individual edits has to respect how the hardware thinks about state transitions, not just how you think about the before and after.

Each connectivity loss taught me something about where my model and MikroTik’s model diverged. The losses weren’t failures. They were the cost of building the right mental model fast, under real conditions, with a safety net that made the learning affordable. Icarus, with a parachute.

The Public IP Space Is Where It Gets Real

The highest-stakes work was everything around the public IP configuration. That’s the interface between the platform and the internet. If the public-facing config is wrong or goes through a bad intermediate state, connectivity drops and the recovery path is IPMI, which works but adds friction.

I could have done more to mitigate this. Switching between IPv4 and IPv6 for management access during the configuration work would have provided an alternative path if one address family went down during a sequence. That’s a legitimate technique for reducing the blast radius of public interface changes. I didn’t use it here, which meant the public interface work required more care and slower iteration.

The goal during this phase was specific: get the VPN up and stable. Once the VPN was running and I could trust the public interface wasn’t going to surprise me, everything else was intra-VLAN work. With the default VLAN always available as a fallback, the risk profile dropped sharply. The VPN milestone was the “we got there” marker. Not fully configured, but past the hardest part.

Confidence by Milestone

The process wasn’t “automate everything and see what breaks.” It was a continual cycle of adding confidence by hitting specific checkpoints.

Bootstrap MVP up and verified. First. Don’t move until that’s solid.

Public interface stable, IPMI confirmed, basic routing working. Second. This is where the connectivity losses happened, and where the iteration on task sequencing produced the refined approach.

VPN up and trusted. Third. This is the inflection point. Once you can reach the router through an encrypted tunnel over the public interface, you have a second recovery path that’s more convenient than IPMI for most scenarios.

Intra-VLAN config and switching. Fourth. At this point the risk profile is different. Mistakes here are recoverable without touching the path that gets you to the device in the first place.

Each milestone was a checkpoint where the current state got verified before moving to the next phase. Not because verification is bureaucratic overhead. Because building on an unverified foundation compounds. If the public interface has a subtle issue and you don’t catch it before layering the VPN on top, you’re now debugging two things instead of one. The sequence discipline keeps the problem space small.

What the Ansible Role Looks Like Now

The unified roles/network/ role that came out of this process reflects the lessons directly. The task sequencing respects how RouterOS handles state transitions: add before remove, verify connectivity at specific checkpoints in the sequence, structure the interface operations to minimize the time spent in intermediate states that could break access.

The declarative api_modify approach means the role expresses desired state rather than imperative steps. Run it twice and it converges rather than duplicating configuration. But the order in which desired state gets applied still matters, and the role encodes the order that actually works rather than the order that seemed logical before contact with the hardware.

This is documented in [ADR-0029] for anyone who wants the formal decision record. The post you’re reading is the story behind it.

The Actual Lesson

Automating live network config isn’t inherently dangerous. It’s a specific class of risk that becomes manageable when you’ve thought carefully about three things before you start.

The recovery path. Before the first automated run, not after the first connectivity loss. IPMI, a known-good state, a clear procedure back to it.

The blast radius of each phase. Public interface work is higher risk than intra-VLAN work. Sequence accordingly. Don’t touch the high-risk stuff until the low-risk stuff is verified and trusted.

The hardware’s model of state transitions, not just yours. Your mental model of before and after is not the same as how the device executes the path between them. The sequencing has to respect both.

Get those three things right and the connectivity losses, when they happen, are learning events. You fly toward the sun. You know exactly what to do when the wax melts. And eventually you find the sequence that doesn’t melt it at all.

Boring automation is still the goal. Getting there sometimes requires a few unplanned lessons in how the hardware thinks.

← back to all writing