Extremely fast Kubernetes jobs orchestration based on Lambdas

Extremely fast Kubernetes jobs orchestration based on Lambdas

The combination of Kubernetes and Helm is widely used for service orchestration, provisioning, and release management. The approach of nodes and pods is flexible and efficient for scheduling predictable workflows.

However, the downside of this flexibility is that it involves two layers of orchestration: one for virtual machines (nodes) and the other for containers (pods). This can result in limitations on scaling bootstrap time since the controller needs to bring up the node first and then schedule pods.

When cold-start time is crucial and jobs are spawned in random quantities, making it difficult to optimize node size, using vendor-locked solutions like AWS Lambda directly may seem like an option. However, this approach makes release management more complicated as deployments now need to be done in two different spots, which can lead to potential issues.

To address this need, Kubernetes provides a feature called "Jobs" to abstract this kind of activity. While Kubernetes doesn't offer extensive tools to control the run flow of these jobs, it allows controllers to extend their functionality to address this issue.

Jobs

The way Kubernetes abstracts this kind of activity is "Jobs". It doesn't provide a lot of tools to control their run flow but it allows controllers to extend it to address the issue

💡
Kueue is an example of a very feature-rich controller expanding Kubernetes with resources allowing to use queues, concurrency limit, and resource limitations per abstract group, etc.

Given a Kubernetes cluster is well-setup, horizontal autoscaling is most likely used. And thus we end up with the following scaling process:

  1. A service is creating Job resource in Kubernetes using API using certain meta fields Kueue operates with. Job is not created started.
  2. Kueue picks jobs according to concurrency rules and tries to schedule the worlflow as pod(s).
  3. Pods don't have enough resources to start up so they request horizontal auso-scaling which starts a Kubernetes node
  4. When the node is up and in sync, pods start initialization and then eventually run
  5. In certain timeout during absense of jobs, node is decomissioned

Now, from a theoretical and design standpoint, this solution is very general and flexible. It also leverages an impressive orchestration engine used for other workflows besides jobs. However, due to its abstract and multi-layered nature, it performs sluggishly. At point 1, when you enqueue a job, it takes at least a minute just to start the container. In certain edge cases, it can easily take several minutes. Oh no!

Fargate

Before we discuss the ultimate solution, it's important to introduce Fargate. Fargate is a unique alternative to a compute engine that was specifically designed to address the current issue. It seamlessly integrates with Kubernetes and allows you to designate Fargate as the executor of a pod using labels. Unlike classic orchestration methods, Fargate aims to minimize dependent entities. Instead of having scaling nodes running pods, Fargate operates by providing something akin to a dedicated node for each pod.

So, where does this leave us?

  1. A service creates a Job resource in Kubernetes using specific meta fields that Kueue operates with, but the Job is not initially started
  2. Kueue selects jobs based on concurrency rules and attempts to schedule the workflow as pod(s).
  3. The pod is created directly on its own virtual node provided by Fargate.
  4. When the job is finished, the Fargate node is utilized along with the Pod.

By eliminating an unnecessary step and streamlining the orchestration process, we have a specialized service dedicated to providing on-demand resources. Does it work quickly? Not quite. While it's an improvement over the classic scheduler, it's also noticeably more expensive and still takes approximately 60 seconds to get started. Not good enough.

AWS Lambda

Lambda functions in Amazon are ideal for rapid scaling. They have a cold boot time of up to 10 seconds and a warm boot time of less than a second. For extreme peaks, they are the best choice. However to achieve the performance, they are inherently incompatible with Kubernetes orchestration.

Or are they?

Introducing EKS ACK, a set of controllers created to map AWS services from vendors into Kubernetes Resource Objects. This enables the management of resources via Helm releases. Let's break down what this means for us in practical terms:

1. ACK continually tries to synchronize your Kubernetes Resource definition with the actual cloud service entries. This encompasses the complete cycle of creating, modifying, and removing resources.

2. Helm manages resources associated with a microservice within the context of versioned releases. As a result, within a release, we can define ACK Resources, and any changes to these resources will prompt the controller to ensure that the cloud environment reflects the changes.

3. Kubernetes Secrets are integrated into the design and can only be accessed via the Kubernetes API during runtime. Since Lambda does not have a concept of secrets, this is the approach used.

4. At the core of a Helm release is a snapshot of the application prebuilt as a container image. This same image can be utilized as a Lambda runtime by adjusting the ENTRYPOINT and CMD.

Helm Chart

We are going to provision SQS Queue, Lambda and binding between the two using ACK and Helm integration:

Queues

Note that we create two queues to implement DLQ (dead queue).

apiVersion: sqs.services.k8s.aws/v1alpha1
kind: Queue
metadata:
  name: queue
  namespace: service
  annotations:
    services.k8s.aws/region: {{ .Values.region }}
spec:
  visibilityTimeout: "180"
  queueName: Service
  redrivePolicy: |
    {
      "deadLetterTargetArn": "arn:aws:sqs:{{ .Values.region }}:{{ .Values.account }}:ServiceDead",
      "maxReceiveCount": "5"
    }
  policy: |
    {
      "Version": "2012-10-17",
      "Id": "Service",
      "Statement": [
        {
          "Sid": "__sns",
          "Effect": "Allow",
          "Principal": {
            "Service": "sns.amazonaws.com"
          },
          "Action": [
            "sqs:SendMessage",
            "sqs:SendMessageBatch"
          ],
          "Resource": "*"
        },
        {   
          "Sid": "__service",
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::{{ .Values.account }}:user/service"
          },
          "Action": "sqs:SendMessage",
          "Resource": "*"
        }
      ]
    }
---
apiVersion: sqs.services.k8s.aws/v1alpha1
kind: Queue
metadata:
  name: queue-dead
  namespace: service
  annotations:
    services.k8s.aws/region: {{ .Values.region }}
spec:
  queueName: ServiceDead
  policy: |
    {
      "Version": "2012-10-17",
      "Id": "Service",
      "Statement": [
        {
          "Sid": "SNS",
          "Effect": "Allow",
          "Principal": {
            "Service": "sns.amazonaws.com"
          },
          "Action": [
            "sqs:SendMessage",
            "sqs:SendMessageBatch"
          ],
          "Resource": "*"
        },
        {   
          "Sid": "__service",
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::{{ .Values.account }}:user/service"
          },
          "Action": "sqs:SendMessage",
          "Resource": "*"
        }
      ]
    }

Lambda

apiVersion: lambda.services.k8s.aws/v1alpha1
kind: Function
metadata:
  name: service
  namespace: service
  annotations:
    services.k8s.aws/region: {{ .Values.region  }}
spec:
  name: Service
  packageType: Image
  code:
    imageURI: {{ .Values.image }}
  role: arn:aws:iam::{{ .Values.account }}:role/{{ lower .Values.stage }}.ack-lambda-exec
  description: Service The Great
  timeout: 180
  memorySize: {{ .Values.memory }}
  imageConfig:
    command:
      - app/lambdas/service.handler
    entryPoint:
      - /lambda_entry.sh
  vpcConfig:
    securityGroupIDs:
      {{- range .Values.security_groups }}
      - {{ . | quote }}
      {{- end }}
    subnetIDs:
      {{- range .Values.subnets }}
      - {{ . | quote }}
      {{- end }}

Binding

apiVersion: lambda.services.k8s.aws/v1alpha1
kind: EventSourceMapping
metadata:
  name: service
  namespace: service
  annotations:
    services.k8s.aws/region: {{ .Values.region }}
spec:
  scalingConfig:
    maximumConcurrency: {{ .Values.maximumConcurrency }}
  eventSourceARN: arn:aws:sqs:{{ .Values.region }}:{{ .Values.account }}:Service
  functionName: Service
  batchSize: 1

Lambda Service Container

We will use Container Lambda to make deployment primitive compatible to Kubernetes. In this example we will use JS wrapper (pick up any language from documentation).

Handler

We will need a package to properly wrap our execution handler:

npm install aws-lambda @types/aws-lambda --save-prod

Then we need a app/lambdas/action.ts with the following content:

import type { SQSEvent } from "aws-lambda";

import { getSecrets } from "../services/secrets.js";

export const handler = async (event: SQSEvent) => {
  const secrets = await getSecrets();

  // Do stuff here
};

Dockerfile

Compatibility layer is added in just one line:

FROM node:18-bullseye-slim

# Build your service here

COPY --from=public.ecr.aws/datadog/lambda-extension:latest /opt/. /opt/

Deployment

The built Dockerfile is supposed to go to ECR from where it's going to be pulled by AWS Lambda during deployment.

Summary

Using Helm Chart and Docker container, we now deliverable that's automatically provisioned from operator controller. We got versioning, kubernetes compatibility and the incredible speed of Lambda.

Yay :).