// THE LAB
A working reference. Not a sandbox.
We don't advise on things we aren't willing to build ourselves. This is a production platform — private LLM inference, GitOps-managed Kubernetes, distributed storage, full network automation. Running real workloads, managed entirely through code, every decision documented. The same platform we'd build on AWS, in a hybrid environment, or on bare metal. We run ours on bare metal. When a client asks whether that stuff actually works, this is the answer.
Most infrastructure consultants show you slides. We show you a running system.
Platform engineering, private LLM inference, GitOps, distributed storage, network automation — all of it running under real conditions, managed entirely through code, every decision documented. The platform exists so we stay close enough to the work to know what's actually possible and so new patterns get tested in production before they show up in a client engagement.
We build this the same way we'd build it on AWS, in a hybrid environment, or in a datacenter. The architecture decisions, the failure modes, the operational discipline — those don't change based on where the compute lives.
Ours runs on bare metal. Not because it's required, and not to prove a point about difficulty. Running the full stack — from switch config to container runtime to application layer — is how we stay genuinely deep across all of it. You can't develop real intuition about EKS or GKE without understanding what the control plane is actually doing underneath. You can't advise on storage architecture without having debugged OSD placement at 2am. The bare metal practice is the mechanism. It's what makes the knowledge wide and the capability transferable, whether the work lands on managed cloud, hybrid, or hardware.
// PLATFORM STACK
APPS & SERVICES
Workflow automation: runs in-cluster under full operational control, owns inbound contact form processing, list signups, and operational glue between platform services
Full-featured web analytics: self-hosted, operator-controlled, first-class funnels and segments, custom dimensions for audience tracking, OIDC via LoginOIDC plugin
Newsletter and email list management: fully operator-controlled, list segmentation, double opt-in, integrated with the contact intake flow
Booking and scheduling: Cal.com cloud for discovery call scheduling; integrates with the rest of the intake flow via webhook
PLATFORM SERVICES
In-house Go service powering the LIVE status panel on this site: aggregates Gatus + cluster signals, public read-only API with rate limiting and CORS
BACKUP & RECOVERY
Kubernetes-native backup: namespace, PVC, and cluster state restore; quarterly drill requirement
Data mover backend for Velero: deduplicating PV data movement to Backblaze B2
Offsite object storage: zero egress fees, S3-compatible, bridge to future second-site Ceph RGW
LLM INFERENCE
LLM inference server: model management, GGUF quantization tiered by hardware phase, GPU offloading, OpenAI-compatible API
Web interface for Ollama: multi-user, conversation history, model switching, RAG attachment via ChromaDB
Vector store backing the Open WebUI RAG pipeline: embeddings storage and similarity search for document grounding
Postgres vector extension: used for relational vector workloads where SQL semantics matter alongside the embeddings
Exposes GPU resources to the Kubernetes scheduler: enables GPU resource requests in pod specs
OBSERVABILITY
Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics: 30-day local retention
Centralized log aggregation: single-binary mode, Ceph-backed storage, 30-day retention
Log shipping DaemonSet: scrapes all pod logs from node filesystem, forwards to Loki
Network flow observability: service-to-service traffic visibility and troubleshooting via Cilium
Status monitoring under full operator control: endpoint health checks, incident notification, status page upstream for the platform status microservice
Error monitoring with no external dependency: Sentry-compatible SDK API, exception tracking and grouping for platform-side and frontend code
SECRETS & REGISTRY
Secret backend: Raft HA, Kubernetes auth method, full audit log; independent of Ceph for resilience
Runtime secret injection from Vault into Kubernetes Secrets via ExternalSecret resources
Static secrets encrypted at rest in Git: value-level encryption with minimal diffs, age as the encryption backend, per-operator keypairs, CI key support
Private container registry: pull-through proxy cache, Helm chart repo, Trivy vulnerability scanning
GITOPS & AUTOMATION
GitOps controller: app-of-apps pattern, declarative cluster state reconciled from Git, SOPS decryption
Deployment packaging for all platform components: versioned charts, per-environment values files
Automated image version tracking: commits updated image tags back to Git for ArgoCD to reconcile
INGRESS & TRAFFIC
Reverse proxy and ingress controller: IngressRoute, Gateway API, middleware, TLS termination, www-to-apex redirect
Automated TLS certificate lifecycle: ACME dns-01 via Google Cloud DNS, wildcard support, ClusterIssuer scope
DNS record automation: publishes hostnames to Google Cloud DNS from Ingress and IngressRoute annotations
CLUSTER NETWORKING
CNI and cluster dataplane: kube-proxy replacement, eBPF-based service routing, network policy, LB-IPAM, L2 announcements; node networking model + IPv6 strategy
STORAGE
CLUSTER
Cluster bootstrap: HA control plane via ClusterConfiguration, stacked etcd, Ansible-orchestrated lifecycle, in-place upgrade strategy
Container runtime: systemd cgroup driver, overlayfs snapshotter, NVIDIA toolkit integration for GPU nodes
HARDWARE NETWORK
Switching fabric: QX-32-R core (L3 routing), TX-64-R leaf and admin; eAPI transport for declarative config; AAA authorization baseline
Edge routing platform: RouterOS v7 on x86, allowlist firewall, NAT masquerade, public IP distribution to ingress, RouterOS API for Ansible
Remote access VPN terminated on the edge router: operator-only access to management plane
PROVISIONING
Bare metal lifecycle: PXE boot, Ubuntu deployment, BMC/IPMI power control, node inventory; pinned to 3.6 with certbot-managed TLS
Node and network configuration: Ubuntu hardening, Arista EOS, MikroTik RouterOS, node labels
Node OS: cgroup v2 defaults align with containerd and kubeadm; kernel 6.8 for full Cilium eBPF support
This is older enterprise hardware. R630s, R730s, a Supermicro, a Quanta, all acquired used and put back to work. That's intentional. If you have hardware sitting in a datacenter that needs to earn its keep, this is what it looks like when it does. The architecture is the same one we'd build on brand new iron or in any cloud account. The substrate doesn't change what's possible.
Networking
Edge Router
Dell R630 · MikroTik RouterOS
4× 40Gb ports · WAN allowlist · NAT · WireGuard termination
Switching
Arista DCS-7050QX-32-R · DCS-7050TX-64-R ×2
core 32× 40Gb · leaf + admin 48× 10Gb each · L3 on the core
Bare Metal Automation
MAAS node
Dedicated provisioning host
PXE boot · DHCP · TFTP · cloud-init · BMC/IPMI control · isolated provisioning VLAN
Compute Nodes
Control plane + platform
3× Dell R630
control plane · platform services · Ceph OSDs
Worker nodes
3× Dell R730
general workloads · GPU compute · Ceph OSDs
Supermicro 6028U-X10DRU-i+
Kubernetes worker · Ceph OSDs
Quanta D51PJ-1ULH-2
Kubernetes worker · Ceph OSDs
Kubernetes overlay: pod CIDR 10.244.0.0/16 · service CIDR 10.96.0.0/12
// DESIGN PRINCIPLES
Reproducible from zero
Every component can be rebuilt from code. No configuration exists only in a running system. No manual steps in the deployment path. If we can't automate it, we don't build it.
Declarative from the substrate up
Declarative state management doesn't start at the Kubernetes layer. It starts at the network. Arista switches and MikroTik routing are configured the same way as cluster workloads: desired state in Git, automation converges to it.
Right for the situation, not best-of-breed by default
Every decision here is documented: not just what was chosen, but what was considered and why. We use RouterOS instead of a Juniper SRX. We run older hardware. The reasoning is explicit and revisitable. Best practice is a starting point, not an answer.
Operationally obvious
The system should be understandable without a lookup table. VLANs follow a clear scheme. Addressing is predictable. Complexity that doesn't improve reliability or debuggability gets cut.
// ARCHITECTURE DECISION RECORDS
An ADR records a significant architectural decision: what was chosen, what was on the table, and the reasoning that won. Each ADR also documents the conditions under which the decision should be revisited. These are not decisions made once and forgotten. They are maintained positions with explicit triggers for when to reopen them. Best practice is a starting point. What is right for the situation is the answer.
0
DECISIONS
accepted and in force
0
DEPENDENCIES TRACKED
decisions that build on or inform each other
0
REVISIT CONDITIONS
documented triggers to reopen a decision
0
SUPERSEDED
revised as conditions changed
counts from docs/adrs — built 2026-05-17 02:53 UTC
View ADRs in GitHub →// FURTHER READING
The lab shows up across the writing. What we built, why we built it that way, what failed first, what we'd do differently.
If this is the kind of infrastructure you want to build,
Start a conversation