Dagger CI on Fly for $3.50/mo

Big Thumb
Review Owner Img

Dmitry Ilyevsky

Founder & CTO
Engineering
Date Icon
February 14, 2024
Date Icon
5 min read

Sooner or later, every developer who needs to deploy their software to production will face the challenges of managing a CI/CD pipeline—well, at least the ones who want to make sure the product is still working after deployment. Many of us have been there: the time from commit to successful deploy (or test failure) increases while productivity takes a dive and costs grow. On top of that, you’re often bogged down by the explosion in complexity of managing an amalgamation of Bash scripts and YAML cobbled together to define your CI/CD "business logic."

At Apoxy, we started to feel the pain even at the MVP stage. Our GitHub Actions pipelines got slower and slower as we dashed towards our launch, demanding more money or bringing our own compute. After briefly thinking through our options and bumping into annoying environment consistency issues trying to scale GitHub actions, we decided to write our CI/CD pipeline using Dagger and run it on Fly.io. The results were so surprisingly good we had to share them.

What is Dagger?

Architecturally, Dagger is a client/server design. The client SDK defines your pipeline graph and talks via API to the Dagger Engine which is a server-side component that executes pipeline steps and imports/exports artifacts. This architecture means that Dagger Engine can be run anywhere. By default, Dagger CLI will spin up the engine inside a local Docker container and connect to it via Unix socket, but there is also the option of using a TCP socket so that the engine can be run remotely.

What is Fly?

Fly.io transforms containers into micro-VMs that run on hardware in 30+ regions on six continents. Basically, you can hand them a Docker container (like a Dagger Engine container) and they’ll autoscale it behind a global anycast load balancer that supports TCP. They even scale-to-zero so that those idle hours are free. Docker + TCP + scale-to-zero ended up being the winning combination for us.

How We Did It

Our new CI/CD design is still being triggered by GitHub Actions. The GHA pipeline is triggered on a merge to the main branch. The pipeline then checks out the source tree, runs a quick setup to download Dagger CLI and the Go toolchain, compiles the pipeline definition, and runs Dagger which connects to a remote Dagger Engine running on a Fly Machine that is spun up on-demand and executes all the pipeline steps. Simple right? Here’s what that looks like:

Yes, our Dagger pipeline has taken on a lot more complexity now but because it is a Proper Programming Language we are much more comfortable implementing the entirety of build and deployment logic in one place. Our pipeline runs Bazel inside a container with shared caches attached and then exports OCI tarballs for subsequent publishing to our Docker registry as well as other things that would get you in trouble with the operations team if you try to do them in Bash.

builder = builder.
    WithDirectory("/workspace", client.Host().Directory("."),
        dagger.ContainerWithDirectoryOpts{
            Exclude: []string{"bazel-*", ".gitignore", ".git/"},
        }).
    WithWorkdir("/workspace").
    WithMountedCache("/root/.cache/", client.CacheVolume("cache-"+hostPlatform)).
    WithMountedCache("/tmp/zig-cache", client.CacheVolume("zig-cache-"+hostPlatform)).
    WithDirectory("/output", client.Directory()).
    WithExec([]string{
        "./bazel.sh",
        "build",
        "--repository_cache=/root/.cache/bazel/repository",
        "--disk_cache=/root/.cache/bazel/disk",
        "--jobs=32",
        "--config=amd64",
        "//cmd/apiserver:oci_tarball",
    }).
    WithExec([]string{"cp", "-L", "bazel-bin/cmd/apiserver/oci_tarball/tarball.tar", "/output/"})

canaryTag := fmt.Sprintf("%s-%s", ref, sha)
digest, err := client.Container(dagger.ContainerOpts{Platform: "linux/amd64"}).
    Import(builder.File("/output/tarball.tar")).
    WithRegistryAuth("gcr.io", "_json_key", client.SetSecret("gcr-password", string(gcrCreds))).
    Publish(ctx, apiServerImageBase+":"+canaryTag)
if err != nil {
    log.Fatal(err)
}

Then we run a set of integration tests (also from Dagger) before triggering an actual production deploy.

To spin up our remote Fly workers on demand we setup our app config (fly.toml) with a TCP service on port 8080 and configure auto start/stop:

[[services]]
  internal_port = 8080
  protocol = "tcp"
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  [[services.ports]]
    port = 8080
  [services.concurrency]
    type = "connections"
    hard_limit = 3
    soft_limit = 1

Our top-level GHA pipeline is now super simple and looks like this:

steps:
  - uses: actions/checkout@v3
  - name: Setup Go
    uses: actions/setup-go@v4
    with:
      go-version: '>=1.21.5'
  - name: Install Dagger CLI
    run: cd /usr/local && { curl -L https://dl.dagger.io/dagger/install.sh | sh; cd -; }
  - name: Run Dagger pipeline
    run: dagger run --progress=plain go run ci/main.go
    Env:
       # DO NOT USE: high chance of p0wnage!
       _EXPERIMENTAL_DAGGER_RUNNER_HOST: tcp://amazing-dagger-worker.fly.dev:8080

The _EXPERIMENTAL_DAGGER_RUNNER_HOST environment variable is a super-secret flag to enable Dagger to use a remote socket.

Easy, right? Wrong!

As you can tell from the comment above, there was one tiny problem when setting up the remote Fly worker, namely - how do we make this secure? By default, the socket used by the Dagger client to communicate with an engine is entirely unsecured. This is not an issue for local Dagger Engine deployments which simply follow the default Docker security model, but for remote connections this unencrypted connection is an obvious security hole.

After quickly iterating over a few options and discovering that the Dagger Engine process takes --tlscert, --tlskey, --tlscacert flags to enable mTLS we reached for a familiar tool - Envoy Proxy.

The plan was to create mTLS certificate pairs for both client and server sides of the TCP connection, then configure Dagger Engine with requisite certificates (attached via Fly Secrets) using the corresponding flags:

[experimental]
  exec = [
    "/usr/local/bin/dagger-engine", "--debug",
    "--config", "/etc/dagger/engine.toml",
    "--addr", "tcp://0.0.0.0:8080",
    "--tlscert", "/etc/dagger-certs/server.crt",
    "", "/etc/dagger-certs/server.key",
    "--tlscacert", "/etc/dagger-certs/ca.crt",
  ]

[[files]]
  guest_path = "/etc/dagger-certs/ca.crt"
  secret_name = "DAGGER_CA"

[[files]]
  guest_path = "/etc/dagger-certs/server.crt"
  secret_name = "SERVER_CERT"

[[files]]
  guest_path = "/etc/dagger-certs/server.key"
  secret_name = "SERVER_KEY"

Next, on the client side, we use Envoy Proxy to accept an unencrypted TCP connection on loopback and proxy it over the public Internet as a mTLS-enabled TCP connection.

The final challenge - how to run Envoy Proxy within the GHA pipeline. Easy! Use the amazing func-e tool from the Tetrate folks. It makes running Envoy so easy that even founders who are semi-permanently behind on sleep can do it:

- name: Install func-e CLI
  run: curl https://func-e.io/install.sh | bash -s -- -b /usr/local/bin
- name: Run Envoy
  run: func-e run -c ci/dagger-client.yaml &
…
- name: Run Dagger pipeline
  run: dagger run --progress=plain go run ci/main.go
  env:
     _EXPERIMENTAL_DAGGER_RUNNER_HOST: tcp://localhost:9080

Notice the runner host is no longer pointing at the remote host but rather localhost, and the dagger-client.yaml configures Envoy for mTLS:

  clusters:
  - name: dagger_tls
    type: STRICT_DNS
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: dagger_tls
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: amazing-dagger-worker.fly.dev
                port_value: 8080
    transport_socket:
      name: envoy.transport_sockets.tls
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
        common_tls_context:
          tls_certificates:
          - certificate_chain: {"filename": "secrets/client.crt"}
            private_key: {"filename": "secrets/client.key"}
          validation_context:
            trusted_ca:
              filename: "secrets/ca.crt"

So yes, there is still a little bit of YAML - you just can’t get away from it! At least this one is static…

Conclusions

So, we did it—the pipeline now runs in seconds compared to the 20 minutes it took before when the whole run was inside GHA, and it does so at minimal cost. We have yet to see the final numbers, but our scrupulous calculations project about $3.50 in server/storage costs per month, considering Fly charges per second (!), and no compute is being used when the pipeline isn’t active. Now let’s do a quick retro:

What worked well:

  1. Fast Main CI/CD Pipeline: Thanks to Dagger's built-in (Buildkit) caching, we are not re-pulling Go/Bazel dependencies on every build. The same goes for container layers. You can also pay for Dagger Cloud to have your caches centralized in their remote cache service.
  2. Cost-Effective: Fly spins up workers on-demand—when there are incoming connections—and shuts everything down when the pipeline isn’t running and they offer simple persistent disks for our caches. Thanks to the greatly reduced runtime duration, we are able to comfortably fit into GitHub Actions' free tier again.
  3. Code Defined in Pure Go: 99% of our pipeline logic is defined in pure Go now. We use Go throughout our monorepo for most other things, so we can easily integrate CI/CD with our libraries to perform integration tests and other tasks that would otherwise require the creation of specialized tooling. You can do the same if you’re using Node, Python, or Elixir.
  4. Locally Runnable Pipeline Code: The pipeline code is now easy to run locally, increasing iteration speed and reducing the number of times you push those awkward tiny commits attempting to improve (or more likely fix) your CI/CD pipeline.

What could be improved:

  1. Managing mTLS: Managing mTLS by hand is tedious and error-prone. Long expiration durations probably will not meet many security certifications’ requirements. It doesn’t have to be like that, though—we are working on automating this and other challenges of running secure, production-grade APIs that will work with any stack. So, if you have this problem, you should talk to us!

You can find a simple version of our CI/CD pipeline in our CLI repo, including all the required setup configs and the right incantations of OpenSSL commands to produce TLS certificates.