Building ALICE Infra (Part 1): How I Started, and Why Docker Swarm

Huy Tran Nhat
February 25, 2026
3 mins read
Building ALICE Infra (Part 1): How I Started, and Why Docker Swarm

ALICE aims to be a digital teammate. To make it reliable, I needed an infrastructure that is simple, safe, and repeatable. For those who haven’t met ALICE yet, it’s an AI application designed to transform internal workflows into seamless, cross-functional collaboration. You can learn more about its origin and vision in this blog post.

Where I started

I started with one clear pain: I did not want to deploy manually anymore.

Manual deploys are slow and risky:

  • steps get missed
  • releases are inconsistent
  • rollback is stressful

So my goal from day one was:

  1. Auto-deploy (no SSH, no manual steps)
  2. Add reliability early: restart, rollback, scaling
  3. Do it without heavy tooling (small team, simple ops)

That’s why I began by containerizing everything, then adding a lightweight orchestration layer.


What is Docker Swarm? (simple explanation)

Docker Swarm is Docker’s built-in orchestration tool. It lets you run containers across multiple machines as one cluster.

In practice, Swarm gave me:

  • Manager node: schedules services and maintains cluster state
  • Worker node: runs containers
  • Services: define what to run (image, env, replicas)
  • Rolling updates: safer releases with lower downtime
  • Overlay network: service-to-service communication inside the cluster

Docker Swarm concepts


Why I chose Swarm for ALICE

I didn’t choose Swarm because it’s trendy. I chose it because it matched ALICE’s stage:

  • Docker-first workflow (easy local → prod parity)
  • Less complexity than Kubernetes
  • Still gives the essentials: health, restart, scale, rolling deploy
  • Fits a small setup: 2 nodes (1 manager + 1 worker)

This was the best balance between speed and stability.


The first system design decision

The core design decision was:

  • Package ALICE into Docker images
  • Run them as Swarm services
  • Keep staging and production separated (different stacks/networks)
  • Keep compute private (I’ll explain in Part 2)

This made deployments predictable: define the desired state, then apply it.

High-level ALICE infra (add simplified overview)


What was interesting (early lessons)

A few things surprised me:

  • Simplicity is a feature. Fewer moving parts = fewer incidents.
  • Networking matters more than CPU for app reliability.
  • The biggest win is not scaling up. It’s deploying safely, every time.

Closing

In Part 2, I’ll share the production design on AWS: VPC, public ALB (TLS), private Swarm cluster, NAT, and the CodeBuild → ECR → deploy workflow.

Try ALICE now: https://app.heyalice.net/

Contact us at: contact@atware.asia