What You Will Learn

  • Problem and solution patterns in self-hosted runner environments
  • How to prevent resource contention on a shared Docker daemon
  • The port assignment problem killed by the birthday paradox
  • How to investigate when “CI that was working suddenly breaks”

Introduction

In Part 2, I wrote about automating E2E tests using WebAuthn and Mailpit. The tests themselves work fine. The problem was the CI infrastructure.

Saru has 4 frontends x 4 backend APIs. E2E tests run independently for each portal, plus there are cross-portal tests (integration tests between portals). Running these in parallel means 7+ jobs executing simultaneously.

Initially, I used GitHub-hosted runners, but E2E tests require a database, Keycloak, and a mail server—lots of Docker containers. GitHub-hosted runners are slow to set up each time and cost-inefficient for parallel execution.

So I migrated to self-hosted runners. The decision was correct, but a flood of new problems emerged.

This article chronicles the problems encountered in a self-hosted CI environment and their solutions, in chronological order. I hope it helps anyone adopting a similar configuration.

1. Docker Desktop/WSL2 Was Too Unstable

The Initial Setup

I initially ran runners on Docker Desktop (WSL2 backend) on Windows. The setup was simple:

Windows Host
  └─ WSL2
      └─ Docker Desktop
          └─ GitHub Actions Runner x N

The problem: “containers randomly die.” During E2E tests, the PostgreSQL container would suddenly vanish, or Keycloak would become unresponsive. docker inspect showed Exit Code: 137 (SIGKILL).

Tracing the cause led to Docker Desktop/WSL2’s virtualization layer:

Container → Docker Engine → WSL2 → Hyper-V → Windows

WSL2 itself runs as a lightweight Hyper-V VM, with Docker stacked on top. When memory pressure rises, WSL2 triggers the OOM Killer, indiscriminately killing Docker containers.

Migration to Hyper-V VM

The solution was to bypass WSL2 and create an Ubuntu VM directly on Hyper-V:

Container → Docker Engine → Ubuntu VM → Hyper-V → Windows
ItemValue
VM Namesaru-ci-runner
OSUbuntu 24.04
vCPU16
Memory64GB
Disk200GB
NetworkExternal Switch (bridged)

Hyper-V VM has fewer virtualization layers and more stable memory management than WSL2. WSL2 uses dynamic memory allocation “shared with the host” (defaulting to 50% of host RAM or up to 8GB), while Hyper-V VM allocates fixed memory, reducing the risk of OOM Killer strikes.

On top of this, I deployed 15 GitHub Actions Runners as systemd services:

1
2
3
4
# saru-hyperv-1 through saru-hyperv-15
for i in $(seq 1 15); do
  sudo systemctl status actions.runner.ko-chan-saru.saru-hyperv-$i
done

15 runners sharing a single Docker daemon. This “sharing” would later cause many problems.

2. Port Collisions via the Birthday Paradox

Problem

E2E tests have each job start its own PostgreSQL, Keycloak, frontend, and backend. To avoid port collisions, I calculated port numbers from the GitHub Actions RUN_ID:

1
2
3
# Initial implementation (problematic)
PORT_OFFSET=$(( RUN_ID % 3000 ))
POSTGRES_PORT=$(( 10000 + PORT_OFFSET ))

This looks fine, but since 15 runners share a single Docker daemon, ports from simultaneously running jobs can collide.

This has the same structure as the birthday paradox. With 3000 possible port offsets and 5 concurrent jobs, the collision probability is about 0.33% (1 - 3000!/(3000^5 × 2995!)). Seems trivial, but when CI runs dozens of times per day, collisions happen several times a week. And when ports collide, you get the cryptic error “container started but service unreachable.”

Solution: RUNNER_NAME-based allocation

Instead of the random RUN_ID, I switched to deterministic port assignment from the runner name:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Extract runner number from name (e.g., saru-hyperv-7 → 7)
if [[ "${RUNNER_NAME}" =~ saru-hyperv-([0-9]+) ]]; then
  RUNNER_NUM=${BASH_REMATCH[1]}
else
  RUNNER_NUM=$(( (RUN_ID % 15) + 1 ))
fi

# Allocate 200-port blocks per runner
RUNNER_BLOCK=$((RUNNER_NUM * 200))
PORTAL_OFFSET=$((PORTAL_INDEX * 10))

# Frontend/Backend: 20000 + (RUNNER_NUM × 200) + (PORTAL_INDEX × 10) + {0,1,2,3}
BASE_PORT=20000
OFFSET=$((RUNNER_BLOCK + PORTAL_OFFSET))
PORTAL_PORT=$((BASE_PORT + OFFSET + 1))
API_PORT=$((BASE_PORT + OFFSET + 2))

# Infra: 30000 + (RUNNER_NUM × 1000) + {0,100,200,...} + PORTAL_INDEX
BASE_INFRA_PORT=30000
INFRA_RUNNER_BLOCK=$((RUNNER_NUM * 1000))
POSTGRES_PORT=$((BASE_INFRA_PORT + INFRA_RUNNER_BLOCK + PORTAL_INDEX))
KEYCLOAK_PORT=$((BASE_INFRA_PORT + INFRA_RUNNER_BLOCK + 100 + PORTAL_INDEX))

The key insight: “each runner executes only one job at a time.” Since the runner number uniquely determines the port block, collisions are impossible by design:

RunnerFrontend RangeInfra Range
saru-hyperv-120200–2031331000–31510
saru-hyperv-220400–2051332000–32510
saru-hyperv-1523000–2311345000–45510

All ports confirmed to fit within 65535.

3. docker system prune Kills Other Jobs’ Containers

Problem

I had Docker cleanup at the end of each CI job:

1
2
3
# ⚠️ This was the problem
- name: Cleanup
  run: docker system prune -f

docker system prune deletes all stopped containers. Since 15 runners share a single Docker daemon, one job’s cleanup can destroy containers another job is actively using.

Especially tricky is the timing right after container startup. If prune runs after Docker Compose or Run starts a container but before the health check passes, the starting container is judged “stopped” and deleted.

Solution: Targeted cleanup

1
2
3
4
5
# ✅ Safe cleanup
# docker system prune and docker container prune are FORBIDDEN
# They destroy concurrent jobs' containers
# Only remove dangling images older than 24 hours
docker image prune -f --filter "until=24h" 2>/dev/null || true

Container deletion targets only those tied to your own RUN_ID. Name patterns filter out persistent containers (saru-postgres-integ, etc.):

1
2
3
4
5
6
7
8
# Delete only this job's containers (protect persistent ones)
RUN_ID="${{ github.run_id }}"
for container in $(docker ps -a --format '{{.Names}}' \
  | { grep "^saru-" | grep -v "saru-postgres-integ\|saru-keycloak-dev\|saru-mailpit-dev" || true; }); do
  if [[ ! "$container" =~ "${RUN_ID}" ]]; then
    docker rm -f "$container" 2>/dev/null || true
  fi
done

Lesson: In shared Docker daemon environments, docker system prune is a banned weapon. Always use scoped deletion.

4. PostgreSQL Silently Crashes from Shared Memory Exhaustion

Problem

PostgreSQL containers in CI started crashing immediately after startup. pg_isready succeeds momentarily, but the following psql command returns “container is not running”:

✓ pg_isready -U test → success
✗ psql -U test -c "CREATE DATABASE ..." → container is not running

This is extremely confusing. PostgreSQL internally restarts after initdb, so if pg_isready succeeds right before that restart, the process no longer exists when the next command runs.

But the real cause was different. Docker’s default /dev/shm size (64MB) is insufficient for PostgreSQL.

Solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
docker run -d \
  --name saru-postgres-ci \
  --shm-size=256m \  # ← This is critical
  --restart=unless-stopped \
  -e POSTGRES_USER=test \
  -e POSTGRES_PASSWORD=test \
  -p 15432:5432 \
  --health-cmd "pg_isready -U test" \
  postgres:16-alpine \
  postgres -c max_connections=200

Specifying --shm-size=256m ensures adequate shared memory for PostgreSQL. Parallel tests (-parallel 2 or higher) especially need more shared buffers, and 64MB is not enough.

This was an intermittent issue, only reproducing under high test load. Root cause identification took a full day.

Whether OOM Killer is the cause can be verified with docker inspect:

1
2
docker inspect "${CONTAINER}" --format '{{.State.OOMKilled}}'
# true means memory exhaustion was the cause

5. The Persistent PostgreSQL Container Pattern

Problem

Initially, each job started and stopped its own PostgreSQL container. However:

  • Container startup takes 10–15 seconds each time
  • Ports are not released immediately on stop, causing collisions on next startup
  • Container lifecycle management becomes complex (forgotten stops, zombie containers, etc.)

Solution: Persistent Container + Per-Job Database

Combined with the --shm-size=256m from section 4, I switched to keeping the container running permanently and creating/dropping temporary databases per job:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Start PostgreSQL container if not running (first time only)
- name: Start persistent PostgreSQL
  run: |
    POSTGRES_CONTAINER="saru-postgres-integ"
    if ! docker ps --format '{{.Names}}' | grep -qx "${POSTGRES_CONTAINER}"; then
      docker run -d \
        --name "${POSTGRES_CONTAINER}" \
        --shm-size=256m \
        --restart=unless-stopped \
        -e POSTGRES_USER=test \
        -p 15432:5432 \
        postgres:16-alpine
    fi

    # Create database for this job
    DB_NAME="integ_${{ github.run_id }}"
    docker exec "${POSTGRES_CONTAINER}" \
      psql -U test -h 127.0.0.1 -c "CREATE DATABASE \"${DB_NAME}\" OWNER test;"

# Delete only the database at job end
- name: Cleanup database
  if: always()
  run: |
    docker exec saru-postgres-integ \
      psql -U test -h 127.0.0.1 \
      -c "DROP DATABASE IF EXISTS \"integ_${{ github.run_id }}\";"

    # Also clean up stale databases from past failed jobs
    STALE_DBS=$(docker exec saru-postgres-integ psql -U test \
      -d postgres -h 127.0.0.1 -tAc \
      "SELECT datname FROM pg_database WHERE datname LIKE 'integ_%';")
    for DB in $STALE_DBS; do
      docker exec saru-postgres-integ psql -U test \
        -d postgres -h 127.0.0.1 \
        -c "DROP DATABASE IF EXISTS \"${DB}\";"
    done

Three key points:

  1. --restart=unless-stopped: Container auto-recovers even when VM restarts
  2. Job ID as database name: Prevents interference between concurrent jobs
  3. Stale database cleanup: Periodically removes garbage left by failed jobs

6. Why Force TCP Connections in psql

Problem

When executing psql inside a PostgreSQL container without specifying the connection method, Unix sockets are used by default. However, PostgreSQL has an initdb → restart cycle on first startup, during which the Unix socket briefly disappears:

1
2
3
4
5
# ⚠️ Unix socket (default): may fail during restart
docker exec postgres psql -U test -c "SELECT 1"

# ✅ TCP connection: retries work during restart
docker exec postgres psql -U test -h 127.0.0.1 -c "SELECT 1"

Adding -h 127.0.0.1 forces TCP connection, making connection failure errors clearer (a definitive “connection refused” rather than an ambiguous “socket file not found”).

Lesson: Always add -h 127.0.0.1 to psql calls in CI scripts.

7. Docker Network Pool Exhaustion

Problem

One day, all E2E jobs suddenly started failing. Error message:

Error response from daemon: could not find an available,
non-overlapping IPv4 address pool among the defaults to
assign to the network

Docker Compose creates a bridge network per project. Docker allocates /16 subnets from the default range 172.17.0.0/16 through 172.31.0.0/16, limiting available networks to about 30. When 15 runners simultaneously run E2E tests and each job creates multiple networks, this pool runs dry.

Solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Delete old networks (protect current RUN_ID's)
RUN_ID="${{ github.run_id }}"
for net in $(docker network ls --format '{{.Name}}' | { grep "^saru-ci-" || true; }); do
  if [[ ! "$net" =~ "${RUN_ID}" ]]; then
    containers=$(docker network inspect "$net" \
      --format '{{len .Containers}}' 2>/dev/null || echo "in-use")
    if [ "$containers" = "0" ]; then
      docker network rm "$net" 2>/dev/null || true
    fi
  fi
done

Delete unused old networks at the start of each job. Instead of docker network prune, filter by name and only remove those with “0 connected containers.”

8. Preventing OTP Contention

Problem

E2E tests include OTP (one-time password) authentication. Each E2E job shares the same Mailpit instance, searching Mailpit’s API for emails to retrieve OTP codes.

The problem: when multiple jobs log in simultaneously with the same email address (e.g., system-admin@saru.local), multiple OTP emails for the same recipient arrive in Mailpit. Timestamp filtering helps somewhat, but millisecond-level contention cannot be fully prevented.

Solution: Per-Job Email Addresses

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
matrix:
  portal:
    - name: system-auth
      system_email: "system-admin@saru.local"
    - name: system-entities
      system_email: "system-entities@saru.local"
    - name: system-products
      system_email: "system-products@saru.local"
    - name: system-misc
      system_email: "system-misc@saru.local"

Each job uses a different system admin account (email address), eliminating OTP email retrieval contention. Multiple system admin accounts are registered in the backend seed data.

9. The hashFiles Syntax Gotcha

Problem

I got stuck trying to use dynamic paths with GitHub Actions’ hashFiles():

1
2
# ⚠️ This doesn't work (string concatenation can't be used inside hashFiles)
key: ${{ runner.os }}-turbo-${{ hashFiles('apps/' + matrix.app + '/**') }}

hashFiles() arguments only accept literal strings. Expressions are not evaluated.

Solution

1
2
# ✅ Use the format() helper
key: ${{ runner.os }}-turbo-${{ matrix.app }}-${{ hashFiles(format('apps/{0}/**', matrix.app), 'packages/**', 'pnpm-lock.yaml', 'turbo.json') }}

Use format() to build the path first, then pass the result to hashFiles(). This is not in the GitHub Actions documentation—I found it in a community discussion (#25718).

10. Migration Round-Trip Testing

CI tests a “migration round-trip” every time:

1
2
3
4
# Up → Down 1 step → Up again
DATABASE_URL="..." go run ./cmd/migrate -action up
DATABASE_URL="..." go run ./cmd/migrate -action down -steps 1
DATABASE_URL="..." go run ./cmd/migrate -action up

This ensures:

  • Down migrations are not broken
  • The “can never Up again after Up→Down” pattern is detected
  • Safety for production rollbacks is guaranteed

11. Automatic Diagnostics on Failure

To make root cause identification easier when CI fails, diagnostic information is collected in if: failure() steps:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
- name: Diagnose PostgreSQL on failure
  if: failure()
  run: |
    # Container state
    docker inspect "${POSTGRES_CONTAINER}" --format '{{.State.Status}}'

    # Active connections
    docker exec "${POSTGRES_CONTAINER}" psql -U test -h 127.0.0.1 \
      -c "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"

    # Container logs (last 30 lines)
    docker logs "${POSTGRES_CONTAINER}" --tail 30

    # Host memory status
    free -h

Just having this breaks the loop of “CI failed → check logs → can’t find the cause → re-run and pray.”

12. Disk Management

Since 15 runners share the same 200GB disk, disk management is critical:

1
2
strategy:
  max-parallel: 2  # Limit concurrency to prevent disk exhaustion

Each job consumes about 2GB for node_modules installation, frontend builds, Playwright browser cache, etc. 8 concurrent jobs means 16GB, plus Docker images and build cache—200GB fills up quickly.

A periodic cleanup script is also prepared:

TargetRetention
CI artifacts7 days
Runner _temp1 day
.turbo cache7 days
node_modules3 days
Go build cache14 days
Docker images30 days
Playwright browsersKeep (essential for E2E)

Summary: Lessons Learned from Self-Hosted CI

LessonDetails
Shared resources collideDocker daemon, ports, networks, disk
Determinism over randomnessPort assignment: RUNNER_NUM-based over RUN_ID % N
prune is a weapondocker system prune is forbidden in shared environments
Persistent container + temporary DBSimplifies container lifecycle management
Force TCPpsql with -h 127.0.0.1 avoids Unix socket traps
Leave diagnostics on failureAutomate root cause identification with if: failure()
Don’t trust Docker defaultsExplicitly specify --shm-size, max_connections

Self-hosted CI brings flexibility and speed that GitHub-hosted cannot match. But it also means accepting the complexity of “managing infrastructure yourself.”

In solo development, when CI breaks, you are the only one who can fix it. That is precisely why it was important to pursue the “why” when problems occur and build systems that prevent recurrence. Every solution presented here was born from an actual incident.


Series Articles