// THE LAB

A working reference. Not a sandbox.

We don't advise on things we aren't willing to build ourselves. This is a production platform — private LLM inference, GitOps-managed Kubernetes, distributed storage, full network automation. Running real workloads, managed entirely through code, every decision documented. The same platform we'd build on AWS, in a hybrid environment, or on bare metal. We run ours on bare metal. When a client asks whether that stuff actually works, this is the answer.

qs platform status
apps0 / 5pending
platform0 / 3pending
cluster0 / 3pending
compute0 / 3pending
network + hardware0 / 8pending

Most infrastructure consultants show you slides. We show you a running system.

Platform engineering, private LLM inference, GitOps, distributed storage, network automation — all of it running under real conditions, managed entirely through code, every decision documented. The platform exists so we stay close enough to the work to know what's actually possible and so new patterns get tested in production before they show up in a client engagement.

We build this the same way we'd build it on AWS, in a hybrid environment, or in a datacenter. The architecture decisions, the failure modes, the operational discipline — those don't change based on where the compute lives.

Ours runs on bare metal. Not because it's required, and not to prove a point about difficulty. Running the full stack — from switch config to container runtime to application layer — is how we stay genuinely deep across all of it. You can't develop real intuition about EKS or GKE without understanding what the control plane is actually doing underneath. You can't advise on storage architecture without having debugged OSD placement at 2am. The bare metal practice is the mechanism. It's what makes the knowledge wide and the capability transferable, whether the work lands on managed cloud, hybrid, or hardware.

APPS & SERVICES

Workflow automation: runs in-cluster under full operational control, owns inbound contact form processing, list signups, and operational glue between platform services

Full-featured web analytics: self-hosted, operator-controlled, first-class funnels and segments, custom dimensions for audience tracking, OIDC via LoginOIDC plugin

Newsletter and email list management: fully operator-controlled, list segmentation, double opt-in, integrated with the contact intake flow

Booking and scheduling: Cal.com cloud for discovery call scheduling; integrates with the rest of the intake flow via webhook

PLATFORM SERVICES

In-house Go service powering the LIVE status panel on this site: aggregates Gatus + cluster signals, public read-only API with rate limiting and CORS

BACKUP & RECOVERY

Kubernetes-native backup: namespace, PVC, and cluster state restore; quarterly drill requirement

Data mover backend for Velero: deduplicating PV data movement to Backblaze B2

Offsite object storage: zero egress fees, S3-compatible, bridge to future second-site Ceph RGW

LLM INFERENCE

LLM inference server: model management, GGUF quantization tiered by hardware phase, GPU offloading, OpenAI-compatible API

Web interface for Ollama: multi-user, conversation history, model switching, RAG attachment via ChromaDB

Vector store backing the Open WebUI RAG pipeline: embeddings storage and similarity search for document grounding

Postgres vector extension: used for relational vector workloads where SQL semantics matter alongside the embeddings

Exposes GPU resources to the Kubernetes scheduler: enables GPU resource requests in pod specs

OBSERVABILITY

Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics: 30-day local retention

Centralized log aggregation: single-binary mode, Ceph-backed storage, 30-day retention

Log shipping DaemonSet: scrapes all pod logs from node filesystem, forwards to Loki

Network flow observability: service-to-service traffic visibility and troubleshooting via Cilium

Status monitoring under full operator control: endpoint health checks, incident notification, status page upstream for the platform status microservice

Error monitoring with no external dependency: Sentry-compatible SDK API, exception tracking and grouping for platform-side and frontend code

SECRETS & REGISTRY

Secret backend: Raft HA, Kubernetes auth method, full audit log; independent of Ceph for resilience

Runtime secret injection from Vault into Kubernetes Secrets via ExternalSecret resources

Static secrets encrypted at rest in Git: value-level encryption with minimal diffs, age as the encryption backend, per-operator keypairs, CI key support

Private container registry: pull-through proxy cache, Helm chart repo, Trivy vulnerability scanning

GITOPS & AUTOMATION

GitOps controller: app-of-apps pattern, declarative cluster state reconciled from Git, SOPS decryption

Deployment packaging for all platform components: versioned charts, per-environment values files

Automated image version tracking: commits updated image tags back to Git for ArgoCD to reconcile

INGRESS & TRAFFIC

Reverse proxy and ingress controller: IngressRoute, Gateway API, middleware, TLS termination, www-to-apex redirect

Automated TLS certificate lifecycle: ACME dns-01 via Google Cloud DNS, wildcard support, ClusterIssuer scope

DNS record automation: publishes hostnames to Google Cloud DNS from Ingress and IngressRoute annotations

CLUSTER NETWORKING

CNI and cluster dataplane: kube-proxy replacement, eBPF-based service routing, network policy, LB-IPAM, L2 announcements; node networking model + IPv6 strategy

STORAGE

Kubernetes operator managing Ceph lifecycle: OSD deployment, pool creation, StorageClass provisioning

3-way replication, host-level failure domain across 4 OSD hosts: block (RBD), object (RGW), and file (CephFS); also backs LLM model weights and object storage

CLUSTER

Cluster bootstrap: HA control plane via ClusterConfiguration, stacked etcd, Ansible-orchestrated lifecycle, in-place upgrade strategy

Container runtime: systemd cgroup driver, overlayfs snapshotter, NVIDIA toolkit integration for GPU nodes

HARDWARE NETWORK

Switching fabric: QX-32-R core (L3 routing), TX-64-R leaf and admin; eAPI transport for declarative config; AAA authorization baseline

Edge routing platform: RouterOS v7 on x86, allowlist firewall, NAT masquerade, public IP distribution to ingress, RouterOS API for Ansible

Remote access VPN terminated on the edge router: operator-only access to management plane

PROVISIONING

Bare metal lifecycle: PXE boot, Ubuntu deployment, BMC/IPMI power control, node inventory; pinned to 3.6 with certbot-managed TLS

Node and network configuration: Ubuntu hardening, Arista EOS, MikroTik RouterOS, node labels

Node OS: cgroup v2 defaults align with containerd and kubeadm; kernel 6.8 for full Cilium eBPF support

LOCATION Los Angeles, CA · single half rack · colocated

This is older enterprise hardware. R630s, R730s, a Supermicro, a Quanta, all acquired used and put back to work. That's intentional. If you have hardware sitting in a datacenter that needs to earn its keep, this is what it looks like when it does. The architecture is the same one we'd build on brand new iron or in any cloud account. The substrate doesn't change what's possible.

Networking

Edge Router

Dell R630 · MikroTik RouterOS

4× 40Gb ports · WAN allowlist · NAT · WireGuard termination

Switching

Arista DCS-7050QX-32-R · DCS-7050TX-64-R ×2

core 32× 40Gb · leaf + admin 48× 10Gb each · L3 on the core

Bare Metal Automation

MAAS node

Dedicated provisioning host

PXE boot · DHCP · TFTP · cloud-init · BMC/IPMI control · isolated provisioning VLAN

Compute Nodes

Control plane + platform

3× Dell R630

control plane · platform services · Ceph OSDs

Worker nodes

3× Dell R730

general workloads · GPU compute · Ceph OSDs

Supermicro 6028U-X10DRU-i+

Kubernetes worker · Ceph OSDs

Quanta D51PJ-1ULH-2

Kubernetes worker · Ceph OSDs

10 Admin / Infrastructure BMC, management, utility services
20 Kubernetes Nodes Node underlay · separate from pod/service CIDRs
30 Ceph Public Client-facing storage traffic
31 Ceph Cluster Replication · isolated from client access
40 MaaS / PXE Provisioning · DHCP and PXE isolated
50 Public Ingress External service exposure

Kubernetes overlay: pod CIDR 10.244.0.0/16 · service CIDR 10.96.0.0/12

Reproducible from zero

Every component can be rebuilt from code. No configuration exists only in a running system. No manual steps in the deployment path. If we can't automate it, we don't build it.

Declarative from the substrate up

Declarative state management doesn't start at the Kubernetes layer. It starts at the network. Arista switches and MikroTik routing are configured the same way as cluster workloads: desired state in Git, automation converges to it.

Right for the situation, not best-of-breed by default

Every decision here is documented: not just what was chosen, but what was considered and why. We use RouterOS instead of a Juniper SRX. We run older hardware. The reasoning is explicit and revisitable. Best practice is a starting point, not an answer.

Operationally obvious

The system should be understandable without a lookup table. VLANs follow a clear scheme. Addressing is predictable. Complexity that doesn't improve reliability or debuggability gets cut.

An ADR records a significant architectural decision: what was chosen, what was on the table, and the reasoning that won. Each ADR also documents the conditions under which the decision should be revisited. These are not decisions made once and forgotten. They are maintained positions with explicit triggers for when to reopen them. Best practice is a starting point. What is right for the situation is the answer.

0

DECISIONS

accepted and in force

0

DEPENDENCIES TRACKED

decisions that build on or inform each other

0

REVISIT CONDITIONS

documented triggers to reopen a decision

0

SUPERSEDED

revised as conditions changed

counts from docs/adrs — built 2026-05-17 02:53 UTC

View ADRs in GitHub →

The lab shows up across the writing. What we built, why we built it that way, what failed first, what we'd do differently.

If this is the kind of infrastructure you want to build,

Start a conversation