What You Will Learn
- Problem and solution patterns in self-hosted runner environments
- How to prevent resource contention on a shared Docker daemon
- The port assignment problem killed by the birthday paradox
- How to investigate when “CI that was working suddenly breaks”
Introduction
In Part 2, I wrote about automating E2E tests using WebAuthn and Mailpit. The tests themselves work fine. The problem was the CI infrastructure.
Saru has 4 frontends x 4 backend APIs. E2E tests run independently for each portal, plus there are cross-portal tests (integration tests between portals). Running these in parallel means 7+ jobs executing simultaneously.
Initially, I used GitHub-hosted runners, but E2E tests require a database, Keycloak, and a mail server—lots of Docker containers. GitHub-hosted runners are slow to set up each time and cost-inefficient for parallel execution.
So I migrated to self-hosted runners. The decision was correct, but a flood of new problems emerged.
This article chronicles the problems encountered in a self-hosted CI environment and their solutions, in chronological order. I hope it helps anyone adopting a similar configuration.
1. Docker Desktop/WSL2 Was Too Unstable
The Initial Setup
I initially ran runners on Docker Desktop (WSL2 backend) on Windows. The setup was simple:
Windows Host
└─ WSL2
└─ Docker Desktop
└─ GitHub Actions Runner x N
The problem: “containers randomly die.” During E2E tests, the PostgreSQL container would suddenly vanish, or Keycloak would become unresponsive. docker inspect showed Exit Code: 137 (SIGKILL).
Tracing the cause led to Docker Desktop/WSL2’s virtualization layer:
Container → Docker Engine → WSL2 → Hyper-V → Windows
WSL2 itself runs as a lightweight Hyper-V VM, with Docker stacked on top. When memory pressure rises, WSL2 triggers the OOM Killer, indiscriminately killing Docker containers.
Migration to Hyper-V VM
The solution was to bypass WSL2 and create an Ubuntu VM directly on Hyper-V:
Container → Docker Engine → Ubuntu VM → Hyper-V → Windows
| Item | Value |
|---|---|
| VM Name | saru-ci-runner |
| OS | Ubuntu 24.04 |
| vCPU | 16 |
| Memory | 64GB |
| Disk | 200GB |
| Network | External Switch (bridged) |
Hyper-V VM has fewer virtualization layers and more stable memory management than WSL2. WSL2 uses dynamic memory allocation “shared with the host” (defaulting to 50% of host RAM or up to 8GB), while Hyper-V VM allocates fixed memory, reducing the risk of OOM Killer strikes.
On top of this, I deployed 15 GitHub Actions Runners as systemd services:
| |
15 runners sharing a single Docker daemon. This “sharing” would later cause many problems.
2. Port Collisions via the Birthday Paradox
Problem
E2E tests have each job start its own PostgreSQL, Keycloak, frontend, and backend. To avoid port collisions, I calculated port numbers from the GitHub Actions RUN_ID:
| |
This looks fine, but since 15 runners share a single Docker daemon, ports from simultaneously running jobs can collide.
This has the same structure as the birthday paradox. With 3000 possible port offsets and 5 concurrent jobs, the collision probability is about 0.33% (1 - 3000!/(3000^5 × 2995!)). Seems trivial, but when CI runs dozens of times per day, collisions happen several times a week. And when ports collide, you get the cryptic error “container started but service unreachable.”
Solution: RUNNER_NAME-based allocation
Instead of the random RUN_ID, I switched to deterministic port assignment from the runner name:
| |
The key insight: “each runner executes only one job at a time.” Since the runner number uniquely determines the port block, collisions are impossible by design:
| Runner | Frontend Range | Infra Range |
|---|---|---|
| saru-hyperv-1 | 20200–20313 | 31000–31510 |
| saru-hyperv-2 | 20400–20513 | 32000–32510 |
| … | … | … |
| saru-hyperv-15 | 23000–23113 | 45000–45510 |
All ports confirmed to fit within 65535.
3. docker system prune Kills Other Jobs’ Containers
Problem
I had Docker cleanup at the end of each CI job:
| |
docker system prune deletes all stopped containers. Since 15 runners share a single Docker daemon, one job’s cleanup can destroy containers another job is actively using.
Especially tricky is the timing right after container startup. If prune runs after Docker Compose or Run starts a container but before the health check passes, the starting container is judged “stopped” and deleted.
Solution: Targeted cleanup
| |
Container deletion targets only those tied to your own RUN_ID. Name patterns filter out persistent containers (saru-postgres-integ, etc.):
| |
Lesson: In shared Docker daemon environments, docker system prune is a banned weapon. Always use scoped deletion.
4. PostgreSQL Silently Crashes from Shared Memory Exhaustion
Problem
PostgreSQL containers in CI started crashing immediately after startup. pg_isready succeeds momentarily, but the following psql command returns “container is not running”:
✓ pg_isready -U test → success
✗ psql -U test -c "CREATE DATABASE ..." → container is not running
This is extremely confusing. PostgreSQL internally restarts after initdb, so if pg_isready succeeds right before that restart, the process no longer exists when the next command runs.
But the real cause was different. Docker’s default /dev/shm size (64MB) is insufficient for PostgreSQL.
Solution
| |
Specifying --shm-size=256m ensures adequate shared memory for PostgreSQL. Parallel tests (-parallel 2 or higher) especially need more shared buffers, and 64MB is not enough.
This was an intermittent issue, only reproducing under high test load. Root cause identification took a full day.
Whether OOM Killer is the cause can be verified with docker inspect:
| |
5. The Persistent PostgreSQL Container Pattern
Problem
Initially, each job started and stopped its own PostgreSQL container. However:
- Container startup takes 10–15 seconds each time
- Ports are not released immediately on stop, causing collisions on next startup
- Container lifecycle management becomes complex (forgotten stops, zombie containers, etc.)
Solution: Persistent Container + Per-Job Database
Combined with the --shm-size=256m from section 4, I switched to keeping the container running permanently and creating/dropping temporary databases per job:
| |
Three key points:
--restart=unless-stopped: Container auto-recovers even when VM restarts- Job ID as database name: Prevents interference between concurrent jobs
- Stale database cleanup: Periodically removes garbage left by failed jobs
6. Why Force TCP Connections in psql
Problem
When executing psql inside a PostgreSQL container without specifying the connection method, Unix sockets are used by default. However, PostgreSQL has an initdb → restart cycle on first startup, during which the Unix socket briefly disappears:
| |
Adding -h 127.0.0.1 forces TCP connection, making connection failure errors clearer (a definitive “connection refused” rather than an ambiguous “socket file not found”).
Lesson: Always add -h 127.0.0.1 to psql calls in CI scripts.
7. Docker Network Pool Exhaustion
Problem
One day, all E2E jobs suddenly started failing. Error message:
Error response from daemon: could not find an available,
non-overlapping IPv4 address pool among the defaults to
assign to the network
Docker Compose creates a bridge network per project. Docker allocates /16 subnets from the default range 172.17.0.0/16 through 172.31.0.0/16, limiting available networks to about 30. When 15 runners simultaneously run E2E tests and each job creates multiple networks, this pool runs dry.
Solution
| |
Delete unused old networks at the start of each job. Instead of docker network prune, filter by name and only remove those with “0 connected containers.”
8. Preventing OTP Contention
Problem
E2E tests include OTP (one-time password) authentication. Each E2E job shares the same Mailpit instance, searching Mailpit’s API for emails to retrieve OTP codes.
The problem: when multiple jobs log in simultaneously with the same email address (e.g., system-admin@saru.local), multiple OTP emails for the same recipient arrive in Mailpit. Timestamp filtering helps somewhat, but millisecond-level contention cannot be fully prevented.
Solution: Per-Job Email Addresses
| |
Each job uses a different system admin account (email address), eliminating OTP email retrieval contention. Multiple system admin accounts are registered in the backend seed data.
9. The hashFiles Syntax Gotcha
Problem
I got stuck trying to use dynamic paths with GitHub Actions’ hashFiles():
| |
hashFiles() arguments only accept literal strings. Expressions are not evaluated.
Solution
| |
Use format() to build the path first, then pass the result to hashFiles(). This is not in the GitHub Actions documentation—I found it in a community discussion (#25718).
10. Migration Round-Trip Testing
CI tests a “migration round-trip” every time:
| |
This ensures:
- Down migrations are not broken
- The “can never Up again after Up→Down” pattern is detected
- Safety for production rollbacks is guaranteed
11. Automatic Diagnostics on Failure
To make root cause identification easier when CI fails, diagnostic information is collected in if: failure() steps:
| |
Just having this breaks the loop of “CI failed → check logs → can’t find the cause → re-run and pray.”
12. Disk Management
Since 15 runners share the same 200GB disk, disk management is critical:
| |
Each job consumes about 2GB for node_modules installation, frontend builds, Playwright browser cache, etc. 8 concurrent jobs means 16GB, plus Docker images and build cache—200GB fills up quickly.
A periodic cleanup script is also prepared:
| Target | Retention |
|---|---|
| CI artifacts | 7 days |
| Runner _temp | 1 day |
| .turbo cache | 7 days |
| node_modules | 3 days |
| Go build cache | 14 days |
| Docker images | 30 days |
| Playwright browsers | Keep (essential for E2E) |
Summary: Lessons Learned from Self-Hosted CI
| Lesson | Details |
|---|---|
| Shared resources collide | Docker daemon, ports, networks, disk |
| Determinism over randomness | Port assignment: RUNNER_NUM-based over RUN_ID % N |
prune is a weapon | docker system prune is forbidden in shared environments |
| Persistent container + temporary DB | Simplifies container lifecycle management |
| Force TCP | psql with -h 127.0.0.1 avoids Unix socket traps |
| Leave diagnostics on failure | Automate root cause identification with if: failure() |
| Don’t trust Docker defaults | Explicitly specify --shm-size, max_connections |
Self-hosted CI brings flexibility and speed that GitHub-hosted cannot match. But it also means accepting the complexity of “managing infrastructure yourself.”
In solo development, when CI breaks, you are the only one who can fix it. That is precisely why it was important to pursue the “why” when problems occur and build systems that prevent recurrence. Every solution presented here was born from an actual incident.
Series Articles
- Part 1: Fighting Unmaintainable Complexity with Automation
- Part 2: Automating WebAuthn Tests in CI
- Part 3: Next.js x Go Monorepo Architecture
- Part 4: Multi-Tenant Isolation with PostgreSQL RLS
- Part 5: Multi-Portal Authentication Pitfalls
- Part 6: Developing a 200K-Line SaaS Alone with Claude Code
- Part 7: Landmines and Solutions in Self-Hosted CI/CD (this article)
- Part 8: Turning Solo Development into Team Development with Claude Code Agent Teams