Building ALICE Infra (Part 1): How I Started, and Why Docker Swarm

Huy Tran Nhat

February 25, 2026

3 mins read

ALICE aims to be a digital teammate. To make it reliable, I needed an infrastructure that is simple, safe, and repeatable. For those who haven’t met ALICE yet, it’s an AI application designed to transform internal workflows into seamless, cross-functional collaboration. You can learn more about its origin and vision in this blog post.

Where I started

I started with one clear pain: I did not want to deploy manually anymore.

Manual deploys are slow and risky:

steps get missed
releases are inconsistent
rollback is stressful

So my goal from day one was:

Auto-deploy (no SSH, no manual steps)
Add reliability early: restart, rollback, scaling
Do it without heavy tooling (small team, simple ops)

That’s why I began by containerizing everything, then adding a lightweight orchestration layer.

What is Docker Swarm? (simple explanation)

Docker Swarm is Docker’s built-in orchestration tool. It lets you run containers across multiple machines as one cluster.

In practice, Swarm gave me:

Manager node: schedules services and maintains cluster state
Worker node: runs containers
Services: define what to run (image, env, replicas)
Rolling updates: safer releases with lower downtime
Overlay network: service-to-service communication inside the cluster

Docker Swarm concepts

Why I chose Swarm for ALICE

I didn’t choose Swarm because it’s trendy. I chose it because it matched ALICE’s stage:

Docker-first workflow (easy local → prod parity)
Less complexity than Kubernetes
Still gives the essentials: health, restart, scale, rolling deploy
Fits a small setup: 2 nodes (1 manager + 1 worker)

This was the best balance between speed and stability.

The first system design decision

The core design decision was:

Package ALICE into Docker images
Run them as Swarm services
Keep staging and production separated (different stacks/networks)
Keep compute private (I’ll explain in Part 2)

This made deployments predictable: define the desired state, then apply it.

High-level ALICE infra (add simplified overview)

What was interesting (early lessons)

A few things surprised me:

Simplicity is a feature. Fewer moving parts = fewer incidents.
Networking matters more than CPU for app reliability.
The biggest win is not scaling up. It’s deploying safely, every time.

Closing

In Part 2, I’ll share the production design on AWS: VPC, public ALB (TLS), private Swarm cluster, NAT, and the CodeBuild → ECR → deploy workflow.

Try ALICE now: https://app.heyalice.net/