// THE LAB

A working reference. Not a sandbox.

We don't advise on things we aren't willing to build ourselves. This is a production platform — private LLM inference, GitOps-managed Kubernetes, distributed storage, full network automation. Running real workloads, managed entirely through code, every decision documented. The same platform we'd build on AWS, in a hybrid environment, or on bare metal. We run ours on bare metal. When a client asks whether that stuff actually works, this is the answer.

qs platform status

apps0 / 5pending

platform0 / 5pending

cluster0 / 3pending

compute0 / 4pending

network + hardware0 / 8pending

qs platform status --all

// apps

web

pendingquorumsystems.io

no data from collector

authentik

pendingauthentik

no data from collector

vault

pendingvault

no data from collector

gated: Phase 6 — ESO first

n8n

pendingn8n

no data from collector

posthog

pendingposthog

no data from collector

gated: Phase 8

// platform

orchestration

gitops

pendingArgoCD

no data from collector

gated: Ceph → Gitea → ArgoCD

tls

pendingcert-manager

no data from collector

gated: cert-manager install

registry

pendinggitea

no data from collector

gated: Phase 3

secrets

pendingvault + ESO

no data from collector

gated: Phase 6

observability

pendingprometheus + grafana

no data from collector

gated: kube-prometheus-stack

// cluster

control plane

cluster

pendingapiserver

no data from collector

gated: kubeadm bootstrap

etcd

pendingetcd

no data from collector

gated: kubeadm bootstrap

storage

pendingRook-Ceph

no data from collector

gated: kubeadm bootstrap

// compute

kubernetes

nodes

pendingk8s nodes

no data from collector

gated: kubeadm bootstrap

gpu

gpu_r730_01

pendingr730-01 · RTX 3090

no data from collector

gated: GPU Operator (Phase 6)

gpu_r730_02

pendingr730-02 · Tesla M10

no data from collector

gated: GPU Operator (Phase 6)

gpu_r730_03

pendingr730-03 · RTX 3090

no data from collector

gated: GPU Operator (Phase 6)

// network + hardware

routing

router

pendinggateway

no data from collector

switching

switch_core

pendingcore-switch-01

no data from collector

switch_admin

pendingadmin-switch-01

no data from collector

switch_leaf

pendingleaf-switch-01

no data from collector

provisioning

maas

pendingMaaS

no data from collector

storage

storage_01

pendingstorage-01

no data from collector

utility

utility_01

pendingutility-01

no data from collector

utility_02

pendingutility-02

no data from collector

Most infrastructure consultants show you slides. We show you a running system.

Platform engineering, private LLM inference, GitOps, distributed storage, network automation — all of it running under real conditions, managed entirely through code, every decision documented. The platform exists so we stay close enough to the work to know what's actually possible and so new patterns get tested in production before they show up in a client engagement.

We build this the same way we'd build it on AWS, in a hybrid environment, or in a datacenter. The architecture decisions, the failure modes, the operational discipline — those don't change based on where the compute lives.

Ours runs on bare metal. Not because it's required, and not to prove a point about difficulty. Running the full stack — from switch config to container runtime to application layer — is how we stay genuinely deep across all of it. You can't develop real intuition about EKS or GKE without understanding what the control plane is actually doing underneath. You can't advise on storage architecture without having debugged OSD placement at 2am. The bare metal practice is the mechanism. It's what makes the knowledge wide and the capability transferable, whether the work lands on managed cloud, hybrid, or hardware.

// PLATFORM STACK

APPS & SERVICES

n8n

Workflow automation: runs in-cluster under full operational control, owns inbound contact form processing, list signups, and operational glue between platform services

Matomo

Full-featured web analytics: self-hosted, operator-controlled, first-class funnels and segments, custom dimensions for audience tracking, OIDC via LoginOIDC plugin

Listmonk

Newsletter and email list management: fully operator-controlled, list segmentation, double opt-in, integrated with the contact intake flow

Cal.com

Booking and scheduling: Cal.com cloud for discovery call scheduling; integrates with the rest of the intake flow via webhook

PLATFORM SERVICES

Platform Status microservice

In-house Go service powering the LIVE status panel on this site: aggregates Gatus + cluster signals, public read-only API with rate limiting and CORS

BACKUP & RECOVERY

Velero

Kubernetes-native backup: namespace, PVC, and cluster state restore; quarterly drill requirement

Kopia

Data mover backend for Velero: deduplicating PV data movement to Backblaze B2

Backblaze B2

Offsite object storage: zero egress fees, S3-compatible, bridge to future second-site Ceph RGW

LLM INFERENCE

Ollama

LLM inference server: model management, GGUF quantization tiered by hardware phase, GPU offloading, OpenAI-compatible API

Open WebUI

Web interface for Ollama: multi-user, conversation history, model switching, RAG attachment via ChromaDB

ChromaDB

Vector store backing the Open WebUI RAG pipeline: embeddings storage and similarity search for document grounding

pgvector

Postgres vector extension: used for relational vector workloads where SQL semantics matter alongside the embeddings

NVIDIA device plugin

Exposes GPU resources to the Kubernetes scheduler: enables GPU resource requests in pod specs

OBSERVABILITY

kube-prometheus-stack

Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics: 30-day local retention

Loki

Centralized log aggregation: single-binary mode, Ceph-backed storage, 30-day retention

Promtail

Log shipping DaemonSet: scrapes all pod logs from node filesystem, forwards to Loki

Hubble

Network flow observability: service-to-service traffic visibility and troubleshooting via Cilium

Gatus

Status monitoring under full operator control: endpoint health checks, incident notification, status page upstream for the platform status microservice

GlitchTip

Error monitoring with no external dependency: Sentry-compatible SDK API, exception tracking and grouping for platform-side and frontend code

SECRETS & REGISTRY

Vault

Secret backend: Raft HA, Kubernetes auth method, full audit log; independent of Ceph for resilience

External Secrets Operator

Runtime secret injection from Vault into Kubernetes Secrets via ExternalSecret resources

SOPS + age

Static secrets encrypted at rest in Git: value-level encryption with minimal diffs, age as the encryption backend, per-operator keypairs, CI key support

Harbor

Private container registry: pull-through proxy cache, Helm chart repo, Trivy vulnerability scanning

GITOPS & AUTOMATION

ArgoCD

GitOps controller: app-of-apps pattern, declarative cluster state reconciled from Git, SOPS decryption

Helm

Deployment packaging for all platform components: versioned charts, per-environment values files

ArgoCD Image Updater

Automated image version tracking: commits updated image tags back to Git for ArgoCD to reconcile

INGRESS & TRAFFIC

Traefik

Reverse proxy and ingress controller: IngressRoute, Gateway API, middleware, TLS termination, www-to-apex redirect

cert-manager

Automated TLS certificate lifecycle: ACME dns-01 via Google Cloud DNS, wildcard support, ClusterIssuer scope

External DNS

DNS record automation: publishes hostnames to Google Cloud DNS from Ingress and IngressRoute annotations

CLUSTER NETWORKING

Cilium

CNI and cluster dataplane: kube-proxy replacement, eBPF-based service routing, network policy, LB-IPAM, L2 announcements; node networking model + IPv6 strategy

STORAGE

Rook

Kubernetes operator managing Ceph lifecycle: OSD deployment, pool creation, StorageClass provisioning

Ceph

3-way replication, host-level failure domain across 4 OSD hosts: block (RBD), object (RGW), and file (CephFS); also backs LLM model weights and object storage

CLUSTER

kubeadm

Cluster bootstrap: HA control plane via ClusterConfiguration, stacked etcd, Ansible-orchestrated lifecycle, in-place upgrade strategy

containerd

Container runtime: systemd cgroup driver, overlayfs snapshotter, NVIDIA toolkit integration for GPU nodes

HARDWARE NETWORK

Arista EOS · DCS-7050 series

Switching fabric: QX-32-R core (L3 routing), TX-64-R leaf and admin; eAPI transport for declarative config; AAA authorization baseline

MikroTik RouterOS · Dell R630 edge

Edge routing platform: RouterOS v7 on x86, allowlist firewall, NAT masquerade, public IP distribution to ingress, RouterOS API for Ansible

WireGuard

Remote access VPN terminated on the edge router: operator-only access to management plane

PROVISIONING

MAAS

Bare metal lifecycle: PXE boot, Ubuntu deployment, BMC/IPMI power control, node inventory; pinned to 3.6 with certbot-managed TLS

Ansible

Node and network configuration: Ubuntu hardening, Arista EOS, MikroTik RouterOS, node labels

Ubuntu 24.04 LTS

Node OS: cgroup v2 defaults align with containerd and kubeadm; kernel 6.8 for full Cilium eBPF support

LOCATION Los Angeles, CA · single half rack · colocated

This is older enterprise hardware. R630s, R730s, a Supermicro, a Quanta, all acquired used and put back to work. That's intentional. If you have hardware sitting in a datacenter that needs to earn its keep, this is what it looks like when it does. The architecture is the same one we'd build on brand new iron or in any cloud account. The substrate doesn't change what's possible.

Networking

Edge Router

Dell R630 · MikroTik RouterOS

4× 40Gb ports · WAN allowlist · NAT · WireGuard termination

Switching

Arista DCS-7050QX-32-R · DCS-7050TX-64-R ×2

core 32× 40Gb · leaf + admin 48× 10Gb each · L3 on the core

Bare Metal Automation

MAAS node

Dedicated provisioning host

PXE boot · DHCP · TFTP · cloud-init · BMC/IPMI control · isolated provisioning VLAN

Compute Nodes

Control plane + platform

3× Dell R630

control plane · platform services · Ceph OSDs

Worker nodes

3× Dell R730

general workloads · GPU compute · Ceph OSDs

Supermicro 6028U-X10DRU-i+

Kubernetes worker · Ceph OSDs

Quanta D51PJ-1ULH-2

Kubernetes worker · Ceph OSDs

10 Admin / Infrastructure 10.2.10.0/24 BMC, management, utility services

20 Kubernetes Nodes 10.2.20.0/24 Node underlay · separate from pod/service CIDRs

30 Ceph Public 10.2.30.0/24 Client-facing storage traffic

31 Ceph Cluster 10.2.31.0/24 Replication · isolated from client access

40 MaaS / PXE 10.2.40.0/24 Provisioning · DHCP and PXE isolated

50 Public Ingress 10.2.50.0/24 External service exposure

Kubernetes overlay: pod CIDR 10.244.0.0/16 · service CIDR 10.96.0.0/12

// DESIGN PRINCIPLES

Reproducible from zero

Every component can be rebuilt from code. No configuration exists only in a running system. No manual steps in the deployment path. If we can't automate it, we don't build it.

Declarative from the substrate up

Declarative state management doesn't start at the Kubernetes layer. It starts at the network. Arista switches and MikroTik routing are configured the same way as cluster workloads: desired state in Git, automation converges to it.

Right for the situation, not best-of-breed by default

Every decision here is documented: not just what was chosen, but what was considered and why. We use RouterOS instead of a Juniper SRX. We run older hardware. The reasoning is explicit and revisitable. Best practice is a starting point, not an answer.

Operationally obvious

The system should be understandable without a lookup table. VLANs follow a clear scheme. Addressing is predictable. Complexity that doesn't improve reliability or debuggability gets cut.

// ARCHITECTURE DECISION RECORDS

An ADR records a significant architectural decision: what was chosen, what was on the table, and the reasoning that won. Each ADR also documents the conditions under which the decision should be revisited. These are not decisions made once and forgotten. They are maintained positions with explicit triggers for when to reopen them. Best practice is a starting point. What is right for the situation is the answer.

DECISIONS

accepted and in force

DEPENDENCIES TRACKED

decisions that build on or inform each other

REVISIT CONDITIONS

documented triggers to reopen a decision

SUPERSEDED

revised as conditions changed

counts from docs/adrs — built 2026-05-25 03:25 UTC

View ADRs in GitHub →

// FURTHER READING

The lab shows up across the writing. What we built, why we built it that way, what failed first, what we'd do differently.

Infrastructure Networking Platform Engineering Automation Operations AI & LLMs

If this is the kind of infrastructure you want to build,

Start a conversation