// WRITING / TAGS / PLATFORM ENGINEERING

Building the layer between application teams and the infrastructure they depend on.

Platform engineering is the discipline of building internal platforms that make application teams faster, safer, and more autonomous. The work spans cluster lifecycle, automation, developer experience, observability, and the unglamorous operational substrate that everything else stands on.

Most of what we write about lives here. Partly because it's where we spend most of our time, partly because the platform layer is where most of the interesting boundary problems show up. The platform sits at the seam between the people building product and the systems running it. Decisions made here shape how every team downstream can move.

These posts cover concrete tooling choices like kubeadm, GitOps, Cilium, and the philosophical questions about what a platform actually is, how to know when you've built one, and how to keep it from becoming the thing your developers route around.

// POSTS 18 entries
  1. FIG. 01

    The Cloud Didn't Simplify Infrastructure. It Redistributed the Complexity.

    The cloud didn't make infrastructure simpler. It moved the complexity somewhere less visible and replaced some of it with operational surfaces you have to learn from scratch. The 'cloud is awesome' moment and the 'I have no idea how this actually works' moment are often the same capability viewed six months apart.

  2. FIG. 02

    GitOps Is Not Continuous Delivery: The Difference Matters

    GitOps and continuous delivery are not the same thing. Most teams conflate them in ways that create real operational problems. GitOps is a deployment and reconciliation model. Continuous delivery is a software delivery practice. They compose well but they're solving different problems, and treating them as synonyms produces systems where neither works as well as it should.

  3. FIG. 03

    Eight Weeks to Twelve Minutes

    We took an eight-week release cycle down to twelve minutes. The pipeline work was the easy part. What the acceleration revealed was that most of those eight weeks wasn't work. It was dwell time, and the processes that owned that dwell time had never had to justify themselves against a world that moved faster.

  4. FIG. 04

    Idempotency Is a Promise

    Tools offer idempotency as a good intention. Systems need it as a guarantee. Knowing when to step in and upgrade one to the other is what separates automation that works from automation that works until it doesn't.

  5. FIG. 05

    The Workaround Becomes the Curriculum

    When a team doesn't understand their tooling well enough to trust it, they build around it. The build-around becomes what everyone learns. Nobody ever learns the tooling. The distrust gets baked into onboarding. New engineers inherit the wrong mental model and build on top of it. The workaround defends itself.

  6. FIG. 06

    Why Your CI/CD Pipeline Has 47 Steps and Nobody Knows Why

    The pipeline doesn't have 47 steps because 47 things need to happen. It has 47 steps because trust eroded over time and every erosion event got a new step added on top. The steps aren't doing work. They're doing anxiety.

  7. FIG. 07

    The Overnight Evangelist

    The overnight evangelist is often the most motivated person in the room, responding to a real signal that something needs to change. The problem isn't that they found something. It's the leap from 'I got this working' to 'everyone must use this everywhere.' And the org's response is usually wrong in both directions.

  8. FIG. 08

    Why Infrastructure Is Always Somebody's Second Priority

    Infrastructure work has a visibility problem baked into the nature of the work itself. When it's working nobody notices. When it fails everyone notices. That asymmetry shapes every prioritization conversation infrastructure teams ever have, and it doesn't fix itself with better communication.

  9. FIG. 09

    Network Automation Is Harder Than Server Automation. Do It Anyway.

    Network automation is harder than server automation because the blast radius of a mistake is immediate and the feedback loop is brutal. That brutality is actually the advantage. It forces you to get the abstractions right in a way that server automation lets you defer.

  10. FIG. 10

    SOPS, Age, and the Regex You Think Is a Glob

    SOPS with age keys is the right answer for secrets in a GitOps repo, but two things will silently break you before you understand what's happening: path_regex is not a glob, and 'sops metadata not found' can mean at least three different things.

  11. FIG. 11

    The Operational Surface Is the Cost Nobody Counts

    Most architecture evaluations compare tools in isolation. The better question is: what's the right tool given the operational surface I'm already committed to? Adding a new tool has a real cost that almost never shows up in the analysis. Reusing what's already there has a real value that almost never gets counted.

  12. FIG. 12

    You Can't Outsource Understanding

    You can delegate the work. You can use managed services. You can hire people who know the thing you don't. What you can't do is outsource the comprehension. When something breaks at 2am, the understanding either exists or it doesn't.

  13. FIG. 13

    The Plan Is Not the Schedule

    Good planning isn't about staying on schedule — it's about making better decisions in flight, taking on deliberate technical debt with clear eyes, and arriving at the right destination even when the route changes.

  14. FIG. 14

    handle_absent_entries: remove Almost Deleted Everything

    The thing that makes declarative automation powerful is exactly the thing that makes it dangerous. I wrote a user management task with handle_absent_entries: remove, defined a partial list, and RouterOS refused to execute because it would have deleted the last user with full access permissions. The safety net caught it. The lesson is about knowing where aggressive automation ends and self-inflicted disaster begins.

  15. FIG. 15

    MikroTik Will Delete Everything. It's Still the Right Choice.

    The 24-hour activation window is real. The support response time on a Friday night is real. The disk wipe if you miss the window is real. MikroTik is still the right choice. All of these things are true at the same time.

  16. FIG. 16

    Automating Network Config on Live Hardware

    The destination was never in question. The uncertainty was entirely in the path: how Ansible, RouterOS, and a task sequence would negotiate the journey from current state to desired state on live hardware.

  17. FIG. 17

    What Done Looks Like in Infrastructure Work

    Infrastructure work has a done problem that software delivery mostly solved and we haven't caught up. 'It's working' is not done. 'Nobody is complaining' is not done. And the cost of not defining done is that you never actually finish anything. You just stop actively working on it.

  18. FIG. 18

    Four Waves: How a Home Lab Grows Up

    A home lab isn't a static thing. It grows through distinct phases. Wave one is making something work. Wave two is making it more complicated. Wave three is adding rigor. Wave four is building a true datacenter corollary. Most people stop at wave two. Wave four is where the interesting work is.