Skip to content

Multi-Cloud IAF Backend CI/CD Pipeline — Setup & Usage Guide

Purpose: This guide helps any team set up and use the GitHub Actions CI/CD pipeline to build a Docker image, push it to a cloud container registry, deploy infrastructure services, and deploy the application to a managed Kubernetes cluster — across Azure (AKS), AWS (EKS), and GCP (GKE).


Table of Contents

  1. What This Pipeline Does
  2. How It Triggers
  3. Prerequisites
  4. Configure Self-Hosted Runner on a VM
  5. Customize Branch Name and Runner Labels in Pipeline
  6. GitHub Secrets & Variables Setup
  7. Pipeline env: Block — How Variables Are Mapped
  8. Dockerfile Setup
  9. Workflow File Setup
  10. Full Pipeline YAML — by Hyperscaler
  11. How Each Job Works
  12. Kubernetes Resources Created
  13. First vs Repeated Deployments
  14. Infrastructure Services Reference
  15. Troubleshooting
  16. Quick Checklist

1. What This Pipeline Does

Commit to GitHub  ──────────────────────────────────────────────────────────┐
    ↓                                                                        │
[GitHub Actions Triggered by push / workflow_dispatch]                       │
    ↓                                                                        │
Deploy Infrastructure Services                                               │
   (Elasticsearch, OpenTelemetry Collector, Phoenix, Grafana, Redis)        │
    ↓                                                                        │
Collect Infrastructure Service IPs                                           │
    ↓                                                                        │
Build Docker Image (tagged with Git commit SHA)                              │
    ↓                                                                        │
Push Image to Cloud Container Registry                                       │
    │                                                                        │
    ├── Azure  → Azure Container Registry (ACR)                             │
    ├── AWS    → Amazon Elastic Container Registry (ECR)                    │
    └── GCP    → Google Artifact Registry (GAR)                             │
    ↓                                                                        │
Deploy Application to Kubernetes Cluster                                     │
    │                                                                        │
    ├── Azure  → AKS (Azure Kubernetes Service)                             │
    ├── AWS    → EKS (Amazon Elastic Kubernetes Service)                    │
    └── GCP    → GKE (Google Kubernetes Engine)                             │
    │                                                                        │
    ├── First time?       → Create Deployment + LoadBalancer Service        │
    └── Already exists?  → Update image only ───────────────────────────────┘

Infrastructure services (Elasticsearch, OpenTelemetry Collector, Phoenix, Grafana, Redis) are deployed once and reused on subsequent runs — the pipeline checks if they already exist before applying.


2. How It Triggers

Trigger When
Auto Push to your configured branch (e.g., main, main-copy)
Manual GitHub → Actions → Select workflow → Run workflow

Update the branch name in the workflow YAML to match your deployment branch before using.


3. Prerequisites

Make sure all of these exist before setting up:

Common (All Hyperscalers)

Requirement Details
GitHub Repository With Actions enabled
Self-hosted Runner Linux/X64 machine registered in GitHub
Runner Tools docker, cloud CLI, kubectl
Dockerfile Present at the root of your repository
Kubernetes Cluster Already provisioned and running
Container Registry Already created in your cloud provider

Azure (AKS) Specific

Requirement Details
Azure Subscription With AKS and ACR access
Azure Container Registry (ACR) Already created
AKS Cluster Already provisioned
Service Principal With Contributor or AcrPush + AKS access
kubelogin Installed on the runner

AWS (EKS) Specific

Requirement Details
AWS Account With ECR and EKS access permissions
ECR Repository Already created in AWS
EKS Cluster Already provisioned and running
IAM User/Role With AmazonEC2ContainerRegistryFullAccess + EKS deploy permissions
envsubst, nslookup Installed on runner

GCP (GKE) Specific

Requirement Details
GCP Project With GKE and Artifact Registry enabled
GKE Cluster Already provisioned and running
Artifact Registry Repository Already created
Service Account With roles/container.developer + roles/artifactregistry.writer
gcloud CLI Installed on runner

Install Tools on the Runner (Ubuntu/Debian)

Common tools (all hyperscalers):

# Docker
sudo apt-get install -y docker.io
sudo usermod -aG docker $USER

# kubectl
curl -LO "https://dl.k8s.io/release/$(curl -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

Azure-specific tools:

# Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# kubelogin
sudo az aks install-cli

AWS-specific tools:

# AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt-get install -y unzip
unzip awscliv2.zip && sudo ./aws/install

# envsubst and nslookup
sudo apt-get install -y gettext dnsutils

GCP-specific tools:

# gcloud CLI
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh
source ~/.bashrc

# gke-gcloud-auth-plugin
gcloud components install gke-gcloud-auth-plugin

4. Configure Self-Hosted Runner on a VM

A self-hosted runner is a machine (VM or physical server) that runs your GitHub Actions jobs. This pipeline requires a runner with the labels self-hosted, Linux, X64, and a custom label of your choice (e.g., lafbe, cli1, gke-runner).

Step 1 — Prepare Your VM

Use any Linux VM (AWS EC2, Azure VM, GCP Compute Engine, on-prem, etc.). Recommended specs:

Resource Minimum
OS Ubuntu 20.04 / 22.04 (64-bit)
CPU 2 vCPUs
RAM 4 GB
Disk 30 GB

Step 2 — Register the Runner in GitHub

  1. Go to your GitHub Repository.
  2. Click SettingsActionsRunners.
  3. Click New self-hosted runner.
  4. Select Linux as the operating system and X64 as the architecture.
  5. Run the commands shown by GitHub on your VM:
# Create a folder for the runner
mkdir actions-runner && cd actions-runner

# Download the latest runner package (use the exact URL GitHub shows you)
curl -o actions-runner-linux-x64.tar.gz -L https://github.com/actions/runner/releases/download/v<version>/actions-runner-linux-x64-<version>.tar.gz

# Extract the package
tar xzf ./actions-runner-linux-x64.tar.gz

# Configure the runner (use the exact token GitHub shows you)
./config.sh --url https://github.com/<your-org>/<your-repo> --token <YOUR_TOKEN>

Step 3 — Set Runner Labels

During the ./config.sh configuration step, GitHub will ask:

Enter the name of the runner group to add this runner to: [press Enter for Default]
Enter the name of runner: [your-runner-name]
This runner will have the following labels: 'self-hosted', 'Linux', 'X64'
Enter any additional labels (ex. label-1,label-2): [press Enter to skip]

When it asks for additional labels, enter a custom label that matches your workflow (e.g., lafbe for AKS, cli1 for EKS, gke-runner for GKE):

lafbe

This allows the pipeline to target this runner with:

runs-on: [self-hosted, Linux, X64, lafbe]

If you already registered the runner without the custom label, add it from: GitHub → Settings → Actions → Runners → Click your runner → Edit labels

Step 4 — Start the Runner

Option A: Run manually (for testing)

./run.sh

You will see:

√ Connected to GitHub
Listening for Jobs

Option B: Run as a system service (recommended for production)

# Install as a service
sudo ./svc.sh install

# Start the service
sudo ./svc.sh start

# Check service status
sudo ./svc.sh status

Running as a service ensures the runner starts automatically after a VM reboot.

Step 5 — Verify Runner is Online

Go to GitHub → Settings → Actions → Runners.

You should see your runner listed with status Idle (green dot):

✔ my-runner    Idle    self-hosted, Linux, X64, <your-label>

📷 Screenshot — Runner Online

Self-hosted Runner Online


5. Customize Branch Name and Runner Labels in Pipeline

How to Change the Branch Name

Open your workflow YAML file and find this section at the top:

on:
  push:
    branches:
      - main-copy        # ← CHANGE THIS to your branch name

Example: Multiple branches:

on:
  push:
    branches:
      - main
      - staging
      - production

How to Change the Self-Hosted Runner Labels

The runner labels are defined in the runs-on: field of each job:

jobs:
  deployInfra:
    runs-on: [self-hosted, Linux, X64, lafbe]   # ← CHANGE to your runner label

  buildImage:
    runs-on: [self-hosted, Linux, X64, lafbe]   # ← CHANGE to your runner label

  deploy:
    runs-on: [self-hosted, Linux, X64, lafbe]   # ← CHANGE to your runner label

The labels must exactly match what is registered on your self-hosted runner.

Label Meaning
self-hosted Use a self-hosted runner (not GitHub-hosted)
Linux Runner OS is Linux
X64 Runner architecture is 64-bit
lafbe / cli1 / gke-runner Custom label to target your specific runner

All jobs must use the same runs-on value, otherwise they may run on different machines and lose state.


6. GitHub Secrets & Variables Setup

What are GitHub Secrets and GitHub Variables?

GitHub provides two built-in mechanisms to pass configuration and credentials into your pipeline without hardcoding them in YAML files.


GitHub Secrets

A GitHub Secret is an encrypted, sensitive value stored securely at the repository (or organization) level. It is designed for credentials, tokens, keys, and any information that must never be exposed publicly.

How it works:

  • You create a secret once in the GitHub UI.
  • GitHub encrypts it immediately — even repository admins cannot read it back after saving.
  • The pipeline reads it at runtime using ${{ secrets.SECRET_NAME }}.
  • In logs, GitHub automatically masks the value and replaces it with ***.

Example use cases:

  • Cloud credentials (AZURE_CLIENT_SECRET, AWS_SECRET_ACCESS_KEY, GCP_SA_KEY)
  • API keys and passwords
  • Full content of .env files
# How secrets are used inside a pipeline step
- name: Login to Azure
  run: |
    az login --service-principal \
      --username ${{ secrets.AZURE_CLIENT_ID }} \
      --password ${{ secrets.AZURE_CLIENT_SECRET }} \
      --tenant   ${{ secrets.AZURE_TENANT_ID }}

GitHub Variables

A GitHub Variable is a plain-text, non-sensitive configuration value stored at the repository (or organization) level. It is designed for values that can be seen publicly but you still don't want to hardcode inside YAML files (so you can change them in one place without editing code).

How it works:

  • You create a variable once in the GitHub UI.
  • The value is stored as plain text — it is not encrypted and not masked in logs.
  • The pipeline reads it at runtime using ${{ vars.VARIABLE_NAME }}.

Example use cases:

  • Registry URLs (AZURE_CONTAINER_REGISTRY, ECR_REGISTRY)
  • Cluster names, resource groups, namespaces
  • Deployment names, container names
# How variables are used inside the env: block
env:
  CLUSTER_NAME: ${{ vars.CLUSTER_NAME }}
  NAMESPACE:    ${{ vars.NAMESPACE }}

Key Differences at a Glance

Feature GitHub Secrets GitHub Variables
Purpose Sensitive credentials & keys Non-sensitive configuration values
Encryption Yes — encrypted at rest No — stored as plain text
Visible in logs No — masked as *** Yes — appears in plain text
Readable by admins No — cannot be read back after saving Yes — visible in GitHub UI
YAML syntax ${{ secrets.NAME }} ${{ vars.NAME }}
GitHub UI location Settings → Secrets and variables → Actions Settings → Secrets and variables → Variables
Examples Client secret, SA key, .env file ACR name, cluster name, namespace

How this pipeline uses both

┌─────────────────────────────────────────────────────────────┐
│                    GitHub Repository                        │
│                                                             │
│  ┌──────────────────────┐   ┌─────────────────────────┐    │
│  │   GitHub Variables   │   │    GitHub Secrets        │    │
│  │   (non-sensitive)    │   │    (sensitive)           │    │
│  │                      │   │                          │    │
│  │  CLUSTER_NAME        │   │  AZURE_CLIENT_SECRET     │    │
│  │  NAMESPACE           │   │  AWS_SECRET_ACCESS_KEY   │    │
│  │  ECR_REGISTRY  ...   │   │  GCP_SA_KEY  ...         │    │
│  └──────────┬───────────┘   └────────────┬────────────┘    │
│             │                            │                  │
│             └──────────────┬─────────────┘                  │
│                            ▼                               │
│                  ┌──────────────────┐                      │
│                  │  env: block in   │                      │
│                  │  workflow YAML   │                      │
│                  │  (central mapper)│                      │
│                  └────────┬─────────┘                      │
│                           ▼                               │
│              ${{ env.* }} used throughout pipeline        │
└─────────────────────────────────────────────────────────────┘

Rule of thumb: If you would be uncomfortable posting the value in a public chat message — use a Secret. If it's just a name or a URL — use a Variable.


The pipeline templates separate configuration into two categories to keep everything safe for public GitHub:

Category GitHub UI Path Used for Syntax in YAML
Variables (non-sensitive) Repo → Settings → Secrets and variables → Variables → New repository variable Registry URLs, cluster names, namespaces, deploy names ${{ vars.VAR_NAME }}
Secrets (sensitive) Repo → Settings → Secrets and variables → Actions → New repository secret Passwords, tokens, keys, .env file content ${{ secrets.SECRET_NAME }}

Nothing is hardcoded in the pipeline YAML — all values come from GitHub Variables or Secrets, making the workflow files fully safe to commit to a public repository.

📷 Screenshot — GitHub Variables & Secrets UI

GitHub Variables and Secrets Setup


Azure (AKS)

GitHub Variables (non-sensitive)

Go to: Repo → Settings → Secrets and variables → Variables → New repository variable

Variable Name Description Example Value
AZURE_CONTAINER_REGISTRY ACR login server URL myregistry.azurecr.io
RESOURCE_GROUP Azure Resource Group containing AKS my-resource-group
CLUSTER_NAME AKS cluster name aks-my-cluster
NAMESPACE Kubernetes namespace for the application my-app-namespace
DEPLOY_NAME Name of the Kubernetes Deployment object my-deployment
CONTAINER_NAME Name of the container inside the pod my-container
SHORT_NAME Short label used for ACR image tagging my-app

GitHub Secrets (sensitive)

Go to: Repo → Settings → Secrets and variables → Actions → New repository secret

Secret Name What to Put Example
AZURE_CLIENT_ID Service Principal Application (Client) ID xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
AZURE_CLIENT_SECRET Service Principal Client Secret your-client-secret
AZURE_TENANT_ID Azure Active Directory Tenant ID xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
AZURE_SUBSCRIPTION_ID Azure Subscription ID xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
APP_ENV_FILE Full content of your application .env file (see below)

AWS (EKS)

GitHub Variables (non-sensitive)

Variable Name Description Example Value
ECR_REGISTRY ECR registry URL 123456789012.dkr.ecr.us-east-1.amazonaws.com
ECR_REPOSITORY ECR repository name my-app-repo
EKS_CLUSTER_NAME EKS cluster name eks-my-cluster
NAMESPACE Kubernetes namespace for the application my-app-namespace
DEPLOY_NAME Name of the Kubernetes Deployment object my-deployment
CONTAINER_NAME Name of the container inside the pod my-container
SHORT_NAME Short label used for ECR image tagging my-app

GitHub Secrets (sensitive)

Secret Name What to Put Example
AWS_ACCESS_KEY_ID IAM user Access Key ID AKIAIOSFODNN7EXAMPLE
AWS_SECRET_ACCESS_KEY IAM user Secret Access Key wJalrXUtnFEMI/K7MDENG/...
AWS_SESSION_TOKEN Session token (only if using temporary/assumed-role credentials) AQoDYXdzEJ...
AWS_REGION AWS region where ECR and EKS are hosted us-east-1
APP_ENV_FILE Full content of your application .env file (see below)

GCP (GKE)

GitHub Variables (non-sensitive)

Variable Name Description Example Value
GCP_PROJECT_ID GCP Project ID my-project-123456
GKE_CLUSTER_NAME GKE cluster name gke-my-cluster
GKE_ZONE GKE cluster zone or region us-central1-a
ARTIFACT_REGISTRY Artifact Registry path us-central1-docker.pkg.dev/my-project/my-repo
NAMESPACE Kubernetes namespace for the application my-app-namespace
DEPLOY_NAME Name of the Kubernetes Deployment object my-deployment
CONTAINER_NAME Name of the container inside the pod my-container
SHORT_NAME Short label used for Artifact Registry image tagging my-app

GitHub Secrets (sensitive)

Secret Name What to Put Example
GCP_SA_KEY Base64-encoded Service Account JSON key (retrieve from Vault — see below)
APP_ENV_FILE_GCP Full content of your application .env file (see below)

How to retrieve GCP_SA_KEY from HashiCorp Vault

The GCP Service Account JSON key is stored securely in the team's HashiCorp Vault. Follow these steps to retrieve it and add it as a GitHub secret.

Step 1 — Install Vault

Install the Vault software from your Company Portal (search for "Vault" or "HashiCorp Vault" in the portal and follow the installer).

Step 2 — Open a Command Prompt in the Vault folder and set environment variables

set VAULT_ADDR=https://<your-vault-server-address>
set VAULT_NAMESPACE=<your-vault-namespace>

Replace <your-vault-server-address> and <your-vault-namespace> with the values provided by your DevOps/platform team.

Step 3 — Authenticate using AppRole

vault write auth/approle/login role_id="<your-role-id>" secret_id="<your-secret-id>"

Replace <your-role-id> and <your-secret-id> with the AppRole credentials provided by your platform team. A token will be returned in the output.

Step 4 — Login with the token

Copy the token value from the output of Step 3, then run:

vault login <paste-your-token-here>

Step 5 — Retrieve the GCP Service Account JSON key

vault kv get <your-secret-path>/serviceAccountJson

Replace <your-secret-path> with the Vault path provided by your team (e.g., SA-vppcgcpaaa2022).

The JSON key content will be printed in the terminal.

Step 6 — Base64-encode the JSON key

Copy the JSON output and encode it:

On Linux/macOS:

echo '<paste-json-here>' | base64 -w 0

On Windows (PowerShell):

[Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes((Get-Content -Raw "service-account.json")))

Alternatively, save the JSON to a file first, then encode the file.

Step 7 — Add the base64 value as a GitHub Secret

  1. Go to your GitHub repository → SettingsSecrets and variablesActions.
  2. Click New repository secret.
  3. Name: GCP_SA_KEY
  4. Value: paste the base64-encoded string from Step 6.
  5. Click Add secret.

How to set the environment file secret

Copy the entire content of your backend .env file and paste it as the secret value:

DATABASE_URL=postgres://user:password@host:5432/dbname
SECRET_KEY=your-secret-key
DEBUG=False
ALLOWED_HOSTS=*
REDIS_URL=redis://localhost:6379

The pipeline reads this secret, writes it to a file, and converts it into a Kubernetes secret named app-env. Your pods then receive all these values as environment variables automatically.

How to encode the GCP Service Account key

cat your-service-account-key.json | base64 -w 0

Paste the output as the value of the GCP_SA_KEY secret.


7. Pipeline env: Block — How Variables Are Mapped

The env: block at the top of each workflow YAML is a central mapper — it reads values from GitHub Variables (vars.*) and GitHub Secrets (secrets.*) and exposes them as environment variables used throughout the pipeline.

You do not hardcode any values in the YAML file. Set them once in GitHub UI (Section 6 above) and the pipeline picks them up automatically.

How the env: block works

GitHub Variables (vars.*)  ──┐
                              ├──► env: block (mapper) ──► ${{ env.* }} used everywhere in the pipeline
GitHub Secrets (secrets.*) ──┘

Azure (AKS) — env: block

env:
  # Non-sensitive — pulled from GitHub Variables
  AZURE_CONTAINER_REGISTRY: ${{ vars.AZURE_CONTAINER_REGISTRY }}
  RESOURCE_GROUP:            ${{ vars.RESOURCE_GROUP }}
  CLUSTER_NAME:              ${{ vars.CLUSTER_NAME }}
  NAMESPACE:                 ${{ vars.NAMESPACE }}
  DEPLOY_NAME:               ${{ vars.DEPLOY_NAME }}
  CONTAINER_NAME:            ${{ vars.CONTAINER_NAME }}
  SHORT_NAME:                ${{ vars.SHORT_NAME }}

  # Fixed infrastructure values — no change needed
  INFRA_NAMESPACE_1: 'kafkatest'
  INFRA_NAMESPACE_2: 'iafv4'
  INFRA_PATH: 'k8s/infra/azure'

Sensitive values (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, APP_ENV_FILE) are referenced directly from secrets.* inside individual pipeline steps — they do not appear in the env: block.

AWS (EKS) — env: block

env:
  # Non-sensitive — pulled from GitHub Variables
  ECR_REGISTRY:     ${{ vars.ECR_REGISTRY }}
  ECR_REPOSITORY:   ${{ vars.ECR_REPOSITORY }}
  EKS_CLUSTER_NAME: ${{ vars.EKS_CLUSTER_NAME }}
  NAMESPACE:        ${{ vars.NAMESPACE }}
  DEPLOY_NAME:      ${{ vars.DEPLOY_NAME }}
  CONTAINER_NAME:   ${{ vars.CONTAINER_NAME }}
  SHORT_NAME:       ${{ vars.SHORT_NAME }}

  # Fixed infrastructure values — no change needed
  INFRA_NAMESPACE_1: 'kafkatest'
  INFRA_NAMESPACE_2: 'iafv4'
  INFRA_PATH: 'k8s/infra/aws'

Sensitive values (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_REGION, APP_ENV_FILE) are referenced directly from secrets.* inside individual pipeline steps.

GCP (GKE) — env: block

env:
  # Non-sensitive — pulled from GitHub Variables
  GCP_PROJECT_ID:    ${{ vars.GCP_PROJECT_ID }}
  GKE_CLUSTER_NAME:  ${{ vars.GKE_CLUSTER_NAME }}
  GKE_ZONE:          ${{ vars.GKE_ZONE }}
  ARTIFACT_REGISTRY: ${{ vars.ARTIFACT_REGISTRY }}
  NAMESPACE:         ${{ vars.NAMESPACE }}
  DEPLOY_NAME:       ${{ vars.DEPLOY_NAME }}
  CONTAINER_NAME:    ${{ vars.CONTAINER_NAME }}
  SHORT_NAME:        ${{ vars.SHORT_NAME }}

  # Fixed infrastructure values — no change needed
  INFRA_NAMESPACE_1: 'kafkatest'
  INFRA_NAMESPACE_2: 'iafv4'
  INFRA_PATH: 'k8s/infra/gcp'

Sensitive values (GCP_SA_KEY, APP_ENV_FILE_GCP) are referenced directly from secrets.* inside individual pipeline steps.

Fixed infra variables (same across all clouds)

Variable Value Purpose
INFRA_NAMESPACE_1 kafkatest K8s namespace for Elasticsearch, OTel, Phoenix, Grafana
INFRA_NAMESPACE_2 iafv4 K8s namespace for Redis
INFRA_PATH k8s/infra/azure / k8s/infra/aws / k8s/infra/gcp Points to cloud-specific infra manifests with pre-configured LB annotations

Change INFRA_NAMESPACE_1 and INFRA_NAMESPACE_2 only if your Kubernetes manifests use different namespace names. INFRA_PATH should not be changed — it is already set correctly per cloud template.


8. Dockerfile Setup

Your repository must have a Dockerfile at the root.

Minimal Dockerfile Example

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]

With .env build argument (optional, for build-time config):

FROM python:3.11-slim

WORKDIR /app

ARG ENV_FILE=.env
COPY ${ENV_FILE} .env

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]

Note: The pipeline injects environment variables as a Kubernetes Secret at deploy time. The .env approach above is optional and used when the app needs the file at container startup.


9. Workflow File Setup

  1. In your repository, create the folder .github/workflows/ if it doesn't exist.
  2. Inside it, create a workflow YAML file matching your target cloud:
  3. Azure: .github/workflows/deploy-aks.yml
  4. AWS: .github/workflows/deploy-eks.yml
  5. GCP: .github/workflows/deploy-gke.yml
  6. Paste the relevant YAML from Section 10 into that file.
  7. Update the env: block values and branch name to match your setup.
  8. Commit and push — the pipeline will trigger automatically.

Folder structure:

your-repo/
├── .github/
│   └── workflows/
│       ├── deploy-aks.yml    ← Azure
│       ├── deploy-eks.yml    ← AWS
│       └── deploy-gke.yml    ← GCP
├── k8s/
│   └── infra/
│       ├── iaf-es.yaml       ← Elasticsearch
│       ├── iaf-otel.yaml     ← OpenTelemetry Collector
│       ├── iaf-arpx.yaml     ← Phoenix (Arize)
│       ├── iaf-grafana.yaml  ← Grafana
│       └── iaf-redis.yaml    ← Redis
├── Dockerfile
├── requirements.txt
└── ...

10. Full Pipeline YAML — by Hyperscaler

10a. Azure — AKS

name: Build and Deploy to AKS

on:
  push:
    branches:
      - main-copy            # ← Change to your branch
  workflow_dispatch:
    inputs:
      image_version:
        description: 'Manual version input'
        required: false
        default: 'latest'

env:
  AZURE_CONTAINER_REGISTRY: myregistry.azurecr.io  # ← Update (e.g. myregistry.azurecr.io)
  RESOURCE_GROUP: my-resource-group                 # ← Update (e.g. my-resource-group)
  CLUSTER_NAME: my-aks-cluster                      # ← Update (e.g. my-aks-cluster)
  NAMESPACE: my-namespace                           # ← Update (e.g. my-namespace)
  DEPLOY_NAME: my-deployment                        # ← Update (e.g. my-deployment)
  CONTAINER_NAME: my-container                      # ← Update (e.g. my-container)
  SHORT_NAME: my-app                                # ← Update (e.g. my-app)

jobs:

  deployInfra:
    name: Deploy Infrastructure Services
    runs-on: [self-hosted, Linux, X64, my-runner]    # ← Update runner label
    permissions:
      contents: read
    outputs:
      ELASTICSEARCH_IP:  ${{ steps.collect_ips.outputs.ELASTICSEARCH_IP }}
      OTEL_COLLECTOR_IP: ${{ steps.collect_ips.outputs.OTEL_COLLECTOR_IP }}
      PHOENIX_IP:        ${{ steps.collect_ips.outputs.PHOENIX_IP }}
      GRAFANA_IP:        ${{ steps.collect_ips.outputs.GRAFANA_IP }}
      REDIS_IP:          ${{ steps.collect_ips.outputs.REDIS_IP }}
    steps:
      - name: Checkout source code
        uses: actions/checkout@v4

      - name: Azure login
        shell: bash
        env:
          REQUESTS_CA_BUNDLE: /etc/ssl/certs/ca-bundle.crt
          SSL_CERT_FILE: /etc/ssl/certs/ca-bundle.crt
        run: |
          az logout || true
          az account clear || true
          az login --service-principal \
            --username "${{ secrets.AZURE_CLIENT_ID }}" \
            --password "${{ secrets.AZURE_CLIENT_SECRET }}" \
            --tenant "${{ secrets.AZURE_TENANT_ID }}"
          az account set --subscription "${{ secrets.AZURE_SUBSCRIPTION_ID }}"

      - name: Set up kubelogin
        uses: azure/use-kubelogin@v1
        with:
          kubelogin-version: 'v0.0.25'

      - name: Get K8s context
        uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ env.RESOURCE_GROUP }}
          cluster-name: ${{ env.CLUSTER_NAME }}
          admin: 'false'
          use-kubelogin: 'true'

      - name: Ensure Infra Namespaces Exist
        run: |
          kubectl create namespace kafkatest --dry-run=client -o yaml | kubectl apply -f -
          kubectl create namespace iafv4    --dry-run=client -o yaml | kubectl apply -f -

      - name: Apply Elasticsearch
        run: |
          if kubectl get statefulset elasticsearch -n kafkatest > /dev/null 2>&1; then
            echo "INFO: Elasticsearch already exists — skipping."
          else
            kubectl apply -f k8s/infra/azure/iaf-es.yaml
          fi

      - name: Wait for Elasticsearch IP
        id: wait_es_ip
        run: |
          IP=""
          for i in $(seq 1 36); do
            IP=$(kubectl get svc elasticsearch -n kafkatest \
                  -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null || true)
            [ -n "$IP" ] && break
            echo "Attempt $i/36 — retrying in 10s..."
            sleep 10
          done
          [ -n "$IP" ] || { echo "ERROR: Could not get Elasticsearch IP."; exit 1; }
          echo "ELASTICSEARCH_IP=$IP" >> "$GITHUB_OUTPUT"

      - name: Apply OpenTelemetry Collector
        run: |
          ES_IP="${{ steps.wait_es_ip.outputs.ELASTICSEARCH_IP }}"
          if kubectl get deployment opentelemetry-collector -n kafkatest > /dev/null 2>&1; then
            echo "INFO: OpenTelemetry Collector already exists — skipping."
          else
            sed "s|ELASTICSEARCH_ENDPOINT|${ES_IP}|g" k8s/infra/azure/iaf-otel.yaml | kubectl apply -f -
          fi

      - name: Apply Phoenix (Arize)
        run: |
          kubectl get deployment phoenix -n kafkatest > /dev/null 2>&1 || kubectl apply -f k8s/infra/azure/iaf-arpx.yaml

      - name: Apply Grafana
        run: |
          kubectl get deployment grafana -n kafkatest > /dev/null 2>&1 || kubectl apply -f k8s/infra/azure/iaf-grafana.yaml

      - name: Apply Redis
        run: |
          kubectl get deployment redis -n iafv4 > /dev/null 2>&1 || kubectl apply -f k8s/infra/azure/iaf-redis.yaml

      - name: Wait for All Infra Service IPs
        id: collect_ips
        run: |
          wait_for_ip() {
            local SVC="$1" NS="$2" VAR="$3" IP=""
            for i in $(seq 1 36); do
              IP=$(kubectl get svc "$SVC" -n "$NS" \
                    -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null || true)
              [ -n "$IP" ] && break
              echo "Attempt $i/36 — retrying in 10s..."
              sleep 10
            done
            echo "$VAR=${IP:-UNKNOWN}" >> "$GITHUB_OUTPUT"
          }
          wait_for_ip "elasticsearch"           "kafkatest" "ELASTICSEARCH_IP"
          wait_for_ip "opentelemetry-collector" "kafkatest" "OTEL_COLLECTOR_IP"
          wait_for_ip "phoenix"                 "kafkatest" "PHOENIX_IP"
          wait_for_ip "grafana"                 "kafkatest" "GRAFANA_IP"
          wait_for_ip "redis"                   "iafv4"     "REDIS_IP"

  buildImage:
    name: Build and Push Docker Image
    needs: deployInfra
    runs-on: [self-hosted, Linux, X64, my-runner]    # ← Update runner label
    permissions:
      contents: read
      id-token: write
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Azure login
        shell: bash
        env:
          REQUESTS_CA_BUNDLE: /etc/ssl/certs/ca-bundle.crt
          SSL_CERT_FILE: /etc/ssl/certs/ca-bundle.crt
        run: |
          az logout || true
          az account clear || true
          az login --service-principal \
            --username "${{ secrets.AZURE_CLIENT_ID }}" \
            --password "${{ secrets.AZURE_CLIENT_SECRET }}" \
            --tenant "${{ secrets.AZURE_TENANT_ID }}"
          az account set --subscription "${{ secrets.AZURE_SUBSCRIPTION_ID }}"

      - name: Build and Push Docker Image to ACR
        run: |
          ACR_IMG="${{ env.AZURE_CONTAINER_REGISTRY }}/${{ env.SHORT_NAME }}:${{ github.sha }}"
          unset HTTP_PROXY HTTPS_PROXY
          az acr login --name $(echo "${{ env.AZURE_CONTAINER_REGISTRY }}" | cut -d. -f1)
          docker build --shm-size=8g -t "$ACR_IMG" .
          docker push "$ACR_IMG"

  deploy:
    name: Deploy to AKS
    needs: [deployInfra, buildImage]
    runs-on: [self-hosted, Linux, X64, my-runner]    # ← Update runner label
    permissions:
      actions: read
      contents: read
    steps:
      - name: Checkout source code
        uses: actions/checkout@v4

      - name: Azure login
        shell: bash
        env:
          REQUESTS_CA_BUNDLE: /etc/ssl/certs/ca-bundle.crt
          SSL_CERT_FILE: /etc/ssl/certs/ca-bundle.crt
        run: |
          az logout || true
          az account clear || true
          az login --service-principal \
            --username "${{ secrets.AZURE_CLIENT_ID }}" \
            --password "${{ secrets.AZURE_CLIENT_SECRET }}" \
            --tenant "${{ secrets.AZURE_TENANT_ID }}"
          az account set --subscription "${{ secrets.AZURE_SUBSCRIPTION_ID }}"

      - name: Set up kubelogin
        uses: azure/use-kubelogin@v1
        with:
          kubelogin-version: 'v0.0.25'

      - name: Get K8s context
        uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ env.RESOURCE_GROUP }}
          cluster-name: ${{ env.CLUSTER_NAME }}
          admin: 'false'
          use-kubelogin: 'true'

      - name: Ensure App Namespace Exists
        run: |
          kubectl create namespace ${{ env.NAMESPACE }} --dry-run=client -o yaml | kubectl apply -f -

      - name: Create/Update Kubernetes Secret (with Infra IPs injected)
        env:
          ENV_FILE_CONTENT: ${{ secrets.APP_ENV_FILE_AZURE }}
          ELASTICSEARCH_IP:  ${{ needs.deployInfra.outputs.ELASTICSEARCH_IP }}
          OTEL_COLLECTOR_IP: ${{ needs.deployInfra.outputs.OTEL_COLLECTOR_IP }}
          PHOENIX_IP:        ${{ needs.deployInfra.outputs.PHOENIX_IP }}
          GRAFANA_IP:        ${{ needs.deployInfra.outputs.GRAFANA_IP }}
          REDIS_IP:          ${{ needs.deployInfra.outputs.REDIS_IP }}
        run: |
          [ -z "$ENV_FILE_CONTENT" ] && { echo "ERROR: APP_ENV_FILE_AZURE is empty."; exit 1; }
          printf "%s" "$ENV_FILE_CONTENT" > .env.temp
          awk -F= '!seen[$1]++' .env.temp > .env.cleaned
          sed 's/^[[:space:]]*//; s/[[:space:]]*=[[:space:]]*/=/; s/="\(.*\)"$/=\1/; /^$/d; /^#/d' .env.cleaned > .env.k8s
          printf '%s\n' \
            "PHOENIX_COLLECTOR_ENDPOINT=http://${PHOENIX_IP}:6006" \
            "REDIS_HOST=${REDIS_IP}" \
            "REDIS_PORT=6379" \
            "OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://${OTEL_COLLECTOR_IP}:4318/v1/traces" \
            "OTEL_EXPORTER_OTLP_LOGS_ENDPOINT_HTTP=http://${OTEL_COLLECTOR_IP}:4320/v1/logs" \
            "OTEL_EXPORTER_OTLP_LOGS_ENDPOINT_GRPC=${OTEL_COLLECTOR_IP}:4319" \
            "ELASTICSEARCH_URL=http://${ELASTICSEARCH_IP}:9200" \
            "GRAFANA_URL=http://${GRAFANA_IP}:3000" >> .env.k8s
          tac .env.k8s | awk -F= '!seen[$1]++' | tac > .env.final
          mv .env.final .env.k8s
          kubectl create secret generic app-env \
            --from-env-file=.env.k8s \
            -n ${{ env.NAMESPACE }} --dry-run=client -o yaml | kubectl apply -f -
          rm -f .env.temp .env.cleaned .env.k8s

      - name: Check if Deployment Exists and Deploy
        run: |
          VAR_IMAGE="${{ env.AZURE_CONTAINER_REGISTRY }}/${{ env.SHORT_NAME }}:${{ github.sha }}"
          if kubectl get deployment ${{ env.DEPLOY_NAME }} -n ${{ env.NAMESPACE }} > /dev/null 2>&1; then
            echo "=== Deployment exists. Updating image ==="
            kubectl set image deployment/${{ env.DEPLOY_NAME }} \
              ${{ env.CONTAINER_NAME }}="$VAR_IMAGE" \
              -n ${{ env.NAMESPACE }}
          else
            echo "=== First-time deployment ==="
            cat <<EOF | kubectl apply -f -
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: ${{ env.DEPLOY_NAME }}
            namespace: ${{ env.NAMESPACE }}
          spec:
            replicas: 1
            selector:
              matchLabels:
                app: ${{ env.DEPLOY_NAME }}
            template:
              metadata:
                labels:
                  app: ${{ env.DEPLOY_NAME }}
              spec:
                containers:
                  - name: ${{ env.CONTAINER_NAME }}
                    image: ${VAR_IMAGE}
                    ports:
                      - containerPort: 8000
                    envFrom:
                      - secretRef:
                          name: app-env
                    resources:
                      requests:
                        memory: "512Mi"
                        cpu: "250m"
                      limits:
                        memory: "2Gi"
                        cpu: "1000m"
          ---
          apiVersion: v1
          kind: Service
          metadata:
            name: ${{ env.DEPLOY_NAME }}
            namespace: ${{ env.NAMESPACE }}
            annotations:
              service.beta.kubernetes.io/azure-load-balancer-internal: "true"
          spec:
            selector:
              app: ${{ env.DEPLOY_NAME }}
            ports:
              - protocol: TCP
                port: 8000
                targetPort: 8000
            type: LoadBalancer
          EOF
          fi

      - name: Wait and Check Deployment
        run: |
          kubectl rollout status deployment/${{ env.DEPLOY_NAME }} -n ${{ env.NAMESPACE }} --timeout=5m
          kubectl get pods -n ${{ env.NAMESPACE }} -l app=${{ env.DEPLOY_NAME }}
          kubectl get svc  -n ${{ env.NAMESPACE }} ${{ env.DEPLOY_NAME }}

      - name: Debug on Failure
        if: failure()
        run: |
          kubectl get pods -n ${{ env.NAMESPACE }} -o wide
          kubectl get events -n ${{ env.NAMESPACE }} --sort-by='.lastTimestamp' | tail -n 30
          POD=$(kubectl get pods -n ${{ env.NAMESPACE }} -l app=${{ env.DEPLOY_NAME }} -o jsonpath="{.items[0].metadata.name}")
          [ -n "$POD" ] && kubectl logs "$POD" -n ${{ env.NAMESPACE }} --tail=100 || true

10b. AWS — EKS

name: Build and Deploy to EKS

on:
  push:
    branches:
      - main                  # ← Change to your branch
  workflow_dispatch:
    inputs:
      image_version:
        description: 'Manual version input'
        required: false
        default: 'latest'

env:
  ECR_REGISTRY: 123456789012.dkr.ecr.us-east-1.amazonaws.com   # ← Update (your AWS account ID + region)
  ECR_REPOSITORY: my-ecr-repository                             # ← Update (e.g. my-ecr-repository)
  EKS_CLUSTER_NAME: my-eks-cluster                              # ← Update (e.g. my-eks-cluster)
  NAMESPACE: my-namespace                                        # ← Update (e.g. my-namespace)
  DEPLOY_NAME: my-deployment                                     # ← Update (e.g. my-deployment)
  CONTAINER_NAME: my-container                                   # ← Update (e.g. my-container)
  SHORT_NAME: my-app                                             # ← Update (e.g. my-app)

jobs:

  deployInfra:
    name: Deploy Infrastructure Services
    runs-on: [self-hosted, Linux, X64, my-runner]   # ← Update runner label
    permissions:
      contents: read
    outputs:
      ELASTICSEARCH_IP:  ${{ steps.collect_ips.outputs.ELASTICSEARCH_IP }}
      OTEL_COLLECTOR_IP: ${{ steps.collect_ips.outputs.OTEL_COLLECTOR_IP }}
      PHOENIX_IP:        ${{ steps.collect_ips.outputs.PHOENIX_IP }}
      GRAFANA_IP:        ${{ steps.collect_ips.outputs.GRAFANA_IP }}
      REDIS_IP:          ${{ steps.collect_ips.outputs.REDIS_IP }}
    steps:
      - name: Checkout source code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-session-token: ${{ secrets.AWS_SESSION_TOKEN }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Update kubeconfig for EKS
        run: |
          sudo aws eks update-kubeconfig --name ${{ env.EKS_CLUSTER_NAME }} --region ${{ secrets.AWS_REGION }}

      - name: Ensure Infra Namespaces Exist
        run: |
          kubectl create namespace kafkatest --dry-run=client -o yaml | kubectl apply -f -
          kubectl create namespace iafv4    --dry-run=client -o yaml | kubectl apply -f -

      - name: Apply Elasticsearch
        run: |
          if kubectl get statefulset elasticsearch -n kafkatest > /dev/null 2>&1; then
            echo "INFO: Elasticsearch already exists — skipping."
          else
            kubectl apply -f k8s/infra/aws/iaf-es.yaml
          fi

      - name: Wait for Elasticsearch Hostname
        id: wait_es_ip
        run: |
          EP=""
          for i in $(seq 1 36); do
            EP=$(kubectl get svc elasticsearch -n kafkatest \
                  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' 2>/dev/null || true)
            [ -n "$EP" ] && break
            echo "Attempt $i/36 — retrying in 10s..."
            sleep 10
          done
          [ -n "$EP" ] || { echo "ERROR: Could not get Elasticsearch hostname."; exit 1; }
          echo "ELASTICSEARCH_IP=$EP" >> "$GITHUB_OUTPUT"

      - name: Apply OpenTelemetry Collector
        run: |
          ES_IP="${{ steps.wait_es_ip.outputs.ELASTICSEARCH_IP }}"
          if kubectl get deployment opentelemetry-collector -n kafkatest > /dev/null 2>&1; then
            echo "INFO: OpenTelemetry Collector already exists — skipping."
          else
            sed "s|ELASTICSEARCH_ENDPOINT|${ES_IP}|g" k8s/infra/aws/iaf-otel.yaml | kubectl apply -f -
          fi

      - name: Apply Phoenix (Arize)
        run: |
          kubectl get deployment phoenix -n kafkatest > /dev/null 2>&1 || kubectl apply -f k8s/infra/aws/iaf-arpx.yaml

      - name: Apply Grafana
        run: |
          kubectl get deployment grafana -n kafkatest > /dev/null 2>&1 || kubectl apply -f k8s/infra/aws/iaf-grafana.yaml

      - name: Apply Redis
        run: |
          kubectl get deployment redis -n iafv4 > /dev/null 2>&1 || kubectl apply -f k8s/infra/aws/iaf-redis.yaml

      - name: Wait for All Infra Service Hostnames
        id: collect_ips
        run: |
          wait_for_ep() {
            local SVC="$1" NS="$2" VAR="$3" EP=""
            for i in $(seq 1 36); do
              EP=$(kubectl get svc "$SVC" -n "$NS" \
                    -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' 2>/dev/null || true)
              [ -n "$EP" ] && break
              echo "Attempt $i/36 — retrying in 10s..."
              sleep 10
            done
            echo "$VAR=${EP:-UNKNOWN}" >> "$GITHUB_OUTPUT"
          }
          wait_for_ep "elasticsearch"           "kafkatest" "ELASTICSEARCH_IP"
          wait_for_ep "opentelemetry-collector" "kafkatest" "OTEL_COLLECTOR_IP"
          wait_for_ep "phoenix"                 "kafkatest" "PHOENIX_IP"
          wait_for_ep "grafana"                 "kafkatest" "GRAFANA_IP"
          wait_for_ep "redis"                   "iafv4"     "REDIS_IP"

  buildImage:
    name: Build and Push Docker Image
    needs: deployInfra
    runs-on: [self-hosted, Linux, X64, my-runner]   # ← Update runner label
    permissions:
      contents: read
      id-token: write
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-session-token: ${{ secrets.AWS_SESSION_TOKEN }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Write .env file from secret
        run: echo "${{ secrets.IAF_BACKEND_ENV }}" > .env

      - name: Build Docker image
        run: |
          sudo docker build \
            --build-arg ENV_FILE=.env \
            -t ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${{ github.sha }} .

      - name: Login to AWS ECR
        run: |
          sudo aws ecr get-login-password --region "${{ secrets.AWS_REGION }}" | \
            sudo docker login --username AWS --password-stdin "${{ env.ECR_REGISTRY }}"

      - name: Push Docker image to ECR
        run: |
          sudo docker push ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${{ github.sha }}

  deploy:
    name: Deploy to EKS
    needs: [deployInfra, buildImage]
    runs-on: [self-hosted, Linux, X64, my-runner]   # ← Update runner label
    permissions:
      actions: read
      contents: read
    steps:
      - name: Checkout source code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-session-token: ${{ secrets.AWS_SESSION_TOKEN }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Update kubeconfig for EKS
        run: |
          sudo aws eks update-kubeconfig --name ${{ env.EKS_CLUSTER_NAME }} --region ${{ secrets.AWS_REGION }}

      - name: Ensure Namespace Exists
        run: |
          kubectl get namespace ${{ env.NAMESPACE }} > /dev/null 2>&1 || \
            kubectl create namespace ${{ env.NAMESPACE }}

      - name: Create/Update Kubernetes Secret (with Infra Endpoints injected)
        env:
          ENV_FILE_CONTENT: ${{ secrets.IAF_BACKEND_ENV }}
          ELASTICSEARCH_IP:  ${{ needs.deployInfra.outputs.ELASTICSEARCH_IP }}
          OTEL_COLLECTOR_IP: ${{ needs.deployInfra.outputs.OTEL_COLLECTOR_IP }}
          PHOENIX_IP:        ${{ needs.deployInfra.outputs.PHOENIX_IP }}
          GRAFANA_IP:        ${{ needs.deployInfra.outputs.GRAFANA_IP }}
          REDIS_IP:          ${{ needs.deployInfra.outputs.REDIS_IP }}
        run: |
          [ -z "$ENV_FILE_CONTENT" ] && { echo "ERROR: IAF_BACKEND_ENV is empty."; exit 1; }
          printf "%s" "$ENV_FILE_CONTENT" > .env.temp
          awk -F= '!seen[$1]++' .env.temp > .env.cleaned
          sed 's/^[[:space:]]*//; s/[[:space:]]*=[[:space:]]*/=/; s/="\(.*\)"$/=\1/; /^$/d; /^#/d' .env.cleaned > .env.k8s
          printf '%s\n' \
            "PHOENIX_COLLECTOR_ENDPOINT=http://${PHOENIX_IP}:6006" \
            "REDIS_HOST=${REDIS_IP}" \
            "REDIS_PORT=6379" \
            "OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://${OTEL_COLLECTOR_IP}:4318/v1/traces" \
            "OTEL_EXPORTER_OTLP_LOGS_ENDPOINT_HTTP=http://${OTEL_COLLECTOR_IP}:4320/v1/logs" \
            "OTEL_EXPORTER_OTLP_LOGS_ENDPOINT_GRPC=${OTEL_COLLECTOR_IP}:4319" \
            "ELASTICSEARCH_URL=http://${ELASTICSEARCH_IP}:9200" \
            "GRAFANA_URL=http://${GRAFANA_IP}:3000" >> .env.k8s
          tac .env.k8s | awk -F= '!seen[$1]++' | tac > .env.final
          mv .env.final .env.k8s
          kubectl create secret generic app-env \
            --from-env-file=.env.k8s \
            -n ${{ env.NAMESPACE }} --dry-run=client -o yaml | kubectl apply -f -
          rm -f .env.temp .env.cleaned .env.k8s

      - name: Check if Deployment Exists and Deploy
        run: |
          VAR_IMAGE="${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${{ github.sha }}"
          export VAR_IMAGE
          if kubectl get deployment ${{ env.DEPLOY_NAME }} -n ${{ env.NAMESPACE }} > /dev/null 2>&1; then
            echo "=== Deployment exists. Updating image ==="
            kubectl set image deployment/${{ env.DEPLOY_NAME }} \
              ${{ env.CONTAINER_NAME }}="$VAR_IMAGE" \
              -n ${{ env.NAMESPACE }}
          else
            echo "=== First-time deployment ==="
            envsubst <<'EOF' | kubectl apply -f -
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: ${{ env.DEPLOY_NAME }}
            namespace: ${{ env.NAMESPACE }}
          spec:
            replicas: 1
            selector:
              matchLabels:
                app: ${{ env.DEPLOY_NAME }}
            template:
              metadata:
                labels:
                  app: ${{ env.DEPLOY_NAME }}
              spec:
                containers:
                  - name: ${{ env.CONTAINER_NAME }}
                    image: ${VAR_IMAGE}
                    ports:
                      - containerPort: 8000
                    envFrom:
                      - secretRef:
                          name: app-env
                    resources:
                      requests:
                        memory: "512Mi"
                        cpu: "250m"
                      limits:
                        memory: "2Gi"
                        cpu: "1000m"
          ---
          apiVersion: v1
          kind: Service
          metadata:
            name: ${{ env.DEPLOY_NAME }}
            namespace: ${{ env.NAMESPACE }}
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-internal: "true"
              service.beta.kubernetes.io/aws-load-balancer-source-ranges: "<ALLOWED_CIDR_RANGES>"
          spec:
            selector:
              app: ${{ env.DEPLOY_NAME }}
            ports:
              - protocol: TCP
                port: 8000
                targetPort: 8000
            type: LoadBalancer
          EOF
          fi

      - name: Wait and Check Deployment
        run: |
          kubectl rollout status deployment/${{ env.DEPLOY_NAME }} -n ${{ env.NAMESPACE }} --timeout=5m
          kubectl get pods -n ${{ env.NAMESPACE }} -l app=${{ env.DEPLOY_NAME }}
          kubectl get svc  -n ${{ env.NAMESPACE }} ${{ env.DEPLOY_NAME }}

      - name: Debug on Failure
        if: failure()
        run: |
          kubectl get pods -n ${{ env.NAMESPACE }} -o wide
          kubectl get events -n ${{ env.NAMESPACE }} --sort-by='.lastTimestamp' | tail -n 30
          POD=$(kubectl get pods -n ${{ env.NAMESPACE }} -l app=${{ env.DEPLOY_NAME }} -o jsonpath="{.items[0].metadata.name}")
          [ -n "$POD" ] && kubectl logs "$POD" -n ${{ env.NAMESPACE }} --tail=100 || true

10c. GCP — GKE

name: Build and Deploy to GKE

on:
  push:
    branches:
      - main                  # ← Change to your branch
  workflow_dispatch:
    inputs:
      image_version:
        description: 'Manual version input'
        required: false
        default: 'latest'

env:
  GAR_LOCATION: us-east1                           # ← Update (e.g. us-east1, europe-west1)
  GAR_REPOSITORY: my-gar-repository                # ← Update (e.g. my-gar-repository)
  GKE_CLUSTER: my-gke-cluster                      # ← Update (e.g. my-gke-cluster)
  GKE_ZONE: us-east1-b                             # ← Update (e.g. us-east1-b)
  NAMESPACE: my-namespace                          # ← Update (e.g. my-namespace)
  DEPLOY_NAME: my-deployment                       # ← Update (e.g. my-deployment)
  CONTAINER_NAME: my-container                     # ← Update (e.g. my-container)
  SHORT_NAME: my-app                               # ← Update (e.g. my-app)

jobs:

  deployInfra:
    name: Deploy Infrastructure Services
    runs-on: [self-hosted, Linux, X64, my-runner]  # ← Update runner label
    permissions:
      contents: read
    outputs:
      ELASTICSEARCH_IP:  ${{ steps.collect_ips.outputs.ELASTICSEARCH_IP }}
      OTEL_COLLECTOR_IP: ${{ steps.collect_ips.outputs.OTEL_COLLECTOR_IP }}
      PHOENIX_IP:        ${{ steps.collect_ips.outputs.PHOENIX_IP }}
      GRAFANA_IP:        ${{ steps.collect_ips.outputs.GRAFANA_IP }}
      REDIS_IP:          ${{ steps.collect_ips.outputs.REDIS_IP }}
    steps:
      - name: Checkout source code
        uses: actions/checkout@v4

      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - name: Get GKE credentials
        uses: google-github-actions/get-gke-credentials@v2
        with:
          cluster_name: ${{ env.GKE_CLUSTER }}
          location: ${{ env.GKE_ZONE }}
          project_id: ${{ secrets.GCP_PROJECT_ID }}

      - name: Ensure Infra Namespaces Exist
        run: |
          kubectl create namespace kafkatest --dry-run=client -o yaml | kubectl apply -f -
          kubectl create namespace iafv4    --dry-run=client -o yaml | kubectl apply -f -

      - name: Apply Elasticsearch
        run: |
          if kubectl get statefulset elasticsearch -n kafkatest > /dev/null 2>&1; then
            echo "INFO: Elasticsearch already exists — skipping."
          else
            kubectl apply -f k8s/infra/gcp/iaf-es.yaml
          fi

      - name: Wait for Elasticsearch IP
        id: wait_es_ip
        run: |
          IP=""
          for i in $(seq 1 36); do
            IP=$(kubectl get svc elasticsearch -n kafkatest \
                  -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null || true)
            [ -n "$IP" ] && break
            echo "Attempt $i/36 — retrying in 10s..."
            sleep 10
          done
          [ -n "$IP" ] || { echo "ERROR: Could not get Elasticsearch IP."; exit 1; }
          echo "ELASTICSEARCH_IP=$IP" >> "$GITHUB_OUTPUT"

      - name: Apply OpenTelemetry Collector
        run: |
          ES_IP="${{ steps.wait_es_ip.outputs.ELASTICSEARCH_IP }}"
          if kubectl get deployment opentelemetry-collector -n kafkatest > /dev/null 2>&1; then
            echo "INFO: OpenTelemetry Collector already exists — skipping."
          else
            sed "s|ELASTICSEARCH_ENDPOINT|${ES_IP}|g" k8s/infra/gcp/iaf-otel.yaml | kubectl apply -f -
          fi

      - name: Apply Phoenix (Arize)
        run: |
          kubectl get deployment phoenix -n kafkatest > /dev/null 2>&1 || kubectl apply -f k8s/infra/gcp/iaf-arpx.yaml

      - name: Apply Grafana
        run: |
          kubectl get deployment grafana -n kafkatest > /dev/null 2>&1 || kubectl apply -f k8s/infra/gcp/iaf-grafana.yaml

      - name: Apply Redis
        run: |
          kubectl get deployment redis -n iafv4 > /dev/null 2>&1 || kubectl apply -f k8s/infra/gcp/iaf-redis.yaml

      - name: Wait for All Infra Service IPs
        id: collect_ips
        run: |
          wait_for_ip() {
            local SVC="$1" NS="$2" VAR="$3" IP=""
            for i in $(seq 1 36); do
              IP=$(kubectl get svc "$SVC" -n "$NS" \
                    -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null || true)
              [ -n "$IP" ] && break
              echo "Attempt $i/36 — retrying in 10s..."
              sleep 10
            done
            echo "$VAR=${IP:-UNKNOWN}" >> "$GITHUB_OUTPUT"
          }
          wait_for_ip "elasticsearch"           "kafkatest" "ELASTICSEARCH_IP"
          wait_for_ip "opentelemetry-collector" "kafkatest" "OTEL_COLLECTOR_IP"
          wait_for_ip "phoenix"                 "kafkatest" "PHOENIX_IP"
          wait_for_ip "grafana"                 "kafkatest" "GRAFANA_IP"
          wait_for_ip "redis"                   "iafv4"     "REDIS_IP"

  buildImage:
    name: Build and Push Docker Image
    needs: deployInfra
    runs-on: [self-hosted, Linux, X64, my-runner]  # ← Update runner label
    permissions:
      contents: read
      id-token: write
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - name: Configure Docker for Artifact Registry
        run: |
          gcloud auth configure-docker ${{ env.GAR_LOCATION }}-docker.pkg.dev --quiet

      - name: Build Docker image
        run: |
          IMAGE="${{ env.GAR_LOCATION }}-docker.pkg.dev/${{ secrets.GCP_PROJECT_ID }}/${{ env.GAR_REPOSITORY }}/${{ env.SHORT_NAME }}:${{ github.sha }}"
          docker build -t "$IMAGE" .
          echo "IMAGE=$IMAGE" >> $GITHUB_ENV

      - name: Push Docker image to Artifact Registry
        run: docker push "$IMAGE"

  deploy:
    name: Deploy to GKE
    needs: [deployInfra, buildImage]
    runs-on: [self-hosted, Linux, X64, my-runner]  # ← Update runner label
    permissions:
      actions: read
      contents: read
    steps:
      - name: Checkout source code
        uses: actions/checkout@v4

      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - name: Get GKE credentials
        uses: google-github-actions/get-gke-credentials@v2
        with:
          cluster_name: ${{ env.GKE_CLUSTER }}
          location: ${{ env.GKE_ZONE }}
          project_id: ${{ secrets.GCP_PROJECT_ID }}

      - name: Ensure Namespace Exists
        run: |
          kubectl get namespace ${{ env.NAMESPACE }} > /dev/null 2>&1 || \
            kubectl create namespace ${{ env.NAMESPACE }}

      - name: Create/Update Kubernetes Secret (with Infra IPs injected)
        env:
          ENV_FILE_CONTENT: ${{ secrets.APP_ENV_FILE_GCP }}
          ELASTICSEARCH_IP:  ${{ needs.deployInfra.outputs.ELASTICSEARCH_IP }}
          OTEL_COLLECTOR_IP: ${{ needs.deployInfra.outputs.OTEL_COLLECTOR_IP }}
          PHOENIX_IP:        ${{ needs.deployInfra.outputs.PHOENIX_IP }}
          GRAFANA_IP:        ${{ needs.deployInfra.outputs.GRAFANA_IP }}
          REDIS_IP:          ${{ needs.deployInfra.outputs.REDIS_IP }}
        run: |
          [ -z "$ENV_FILE_CONTENT" ] && { echo "ERROR: APP_ENV_FILE_GCP is empty."; exit 1; }
          printf "%s" "$ENV_FILE_CONTENT" > .env.temp
          awk -F= '!seen[$1]++' .env.temp > .env.cleaned
          sed 's/^[[:space:]]*//; s/[[:space:]]*=[[:space:]]*/=/; s/="\(.*\)"$/=\1/; /^$/d; /^#/d' .env.cleaned > .env.k8s
          printf '%s\n' \
            "PHOENIX_COLLECTOR_ENDPOINT=http://${PHOENIX_IP}:6006" \
            "REDIS_HOST=${REDIS_IP}" \
            "REDIS_PORT=6379" \
            "OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://${OTEL_COLLECTOR_IP}:4318/v1/traces" \
            "OTEL_EXPORTER_OTLP_LOGS_ENDPOINT_HTTP=http://${OTEL_COLLECTOR_IP}:4320/v1/logs" \
            "OTEL_EXPORTER_OTLP_LOGS_ENDPOINT_GRPC=${OTEL_COLLECTOR_IP}:4319" \
            "ELASTICSEARCH_URL=http://${ELASTICSEARCH_IP}:9200" \
            "GRAFANA_URL=http://${GRAFANA_IP}:3000" >> .env.k8s
          tac .env.k8s | awk -F= '!seen[$1]++' | tac > .env.final
          mv .env.final .env.k8s
          kubectl create secret generic app-env \
            --from-env-file=.env.k8s \
            -n ${{ env.NAMESPACE }} --dry-run=client -o yaml | kubectl apply -f -
          rm -f .env.temp .env.cleaned .env.k8s

      - name: Check if Deployment Exists and Deploy
        run: |
          IMAGE="${{ env.GAR_LOCATION }}-docker.pkg.dev/${{ secrets.GCP_PROJECT_ID }}/${{ env.GAR_REPOSITORY }}/${{ env.SHORT_NAME }}:${{ github.sha }}"
          export IMAGE
          if kubectl get deployment ${{ env.DEPLOY_NAME }} -n ${{ env.NAMESPACE }} > /dev/null 2>&1; then
            echo "=== Deployment exists. Updating image ==="
            kubectl set image deployment/${{ env.DEPLOY_NAME }} \
              ${{ env.CONTAINER_NAME }}="$IMAGE" \
              -n ${{ env.NAMESPACE }}
          else
            echo "=== First-time deployment ==="
            envsubst <<'EOF' | kubectl apply -f -
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: ${{ env.DEPLOY_NAME }}
            namespace: ${{ env.NAMESPACE }}
          spec:
            replicas: 1
            selector:
              matchLabels:
                app: ${{ env.DEPLOY_NAME }}
            template:
              metadata:
                labels:
                  app: ${{ env.DEPLOY_NAME }}
              spec:
                containers:
                  - name: ${{ env.CONTAINER_NAME }}
                    image: ${IMAGE}
                    ports:
                      - containerPort: 8000
                    envFrom:
                      - secretRef:
                          name: app-env
                    resources:
                      requests:
                        memory: "512Mi"
                        cpu: "250m"
                      limits:
                        memory: "2Gi"
                        cpu: "1000m"
          ---
          apiVersion: v1
          kind: Service
          metadata:
            name: ${{ env.DEPLOY_NAME }}
            namespace: ${{ env.NAMESPACE }}
            annotations:
              cloud.google.com/load-balancer-type: "Internal"
          spec:
            selector:
              app: ${{ env.DEPLOY_NAME }}
            ports:
              - protocol: TCP
                port: 8000
                targetPort: 8000
            type: LoadBalancer
          EOF
          fi

      - name: Wait and Check Deployment
        run: |
          kubectl rollout status deployment/${{ env.DEPLOY_NAME }} -n ${{ env.NAMESPACE }} --timeout=5m
          kubectl get pods -n ${{ env.NAMESPACE }} -l app=${{ env.DEPLOY_NAME }}
          kubectl get svc  -n ${{ env.NAMESPACE }} ${{ env.DEPLOY_NAME }}

      - name: Debug on Failure
        if: failure()
        run: |
          kubectl get pods -n ${{ env.NAMESPACE }} -o wide
          kubectl get events -n ${{ env.NAMESPACE }} --sort-by='.lastTimestamp' | tail -n 30
          POD=$(kubectl get pods -n ${{ env.NAMESPACE }} -l app=${{ env.DEPLOY_NAME }} -o jsonpath="{.items[0].metadata.name}")
          [ -n "$POD" ] && kubectl logs "$POD" -n ${{ env.NAMESPACE }} --tail=100 || true

11. How Each Job Works

Job 1 — deployInfra: Deploy Infrastructure Services (AKS only; optional for EKS/GKE)

Step What Happens
Checkout code Source code is pulled from GitHub
Cloud login Authenticates to the cloud provider
Get K8s context kubectl is configured to talk to the cluster
Ensure infra namespaces Creates kafkatest and iafv4 namespaces if they don't exist
Apply Elasticsearch Deploys Elasticsearch (skipped if already exists)
Wait for Elasticsearch IP Polls until the LoadBalancer IP is assigned (up to 6 minutes)
Apply OpenTelemetry Collector Deployed with Elasticsearch IP injected into config
Apply Phoenix (Arize) Deploys Phoenix observability platform
Apply Grafana Deploys Grafana dashboard
Apply Redis Deploys Redis cache
Collect all IPs Waits for and outputs LoadBalancer IPs for all infra services

Job 2 — buildImage: Build and Push Docker Image

Step What Happens
Checkout code Source code is pulled from GitHub
Cloud login / credentials Authenticates to cloud provider
Build Docker image Image is built and tagged with github.sha
Login to registry Docker authenticates to ACR / ECR / GAR
Push image Image is pushed to the cloud container registry

Job 3 — deploy: Deploy Application

Step What Happens
Checkout code Source code is pulled again
Cloud login / credentials Re-authenticates to cloud provider
Get K8s context kubectl is configured for the cluster
Ensure app namespace Creates namespace if it doesn't exist
Create/Update K8s Secret .env contents (+ infra IPs for AKS) become a K8s Secret named app-env
Deploy Creates or updates the Kubernetes Deployment
Wait for rollout Waits up to 5 minutes for the Deployment to become ready
Verify application Prints pod status and last 200 log lines

12. Kubernetes Resources Created

On first deployment, two Kubernetes resources are created:

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: <DEPLOY_NAME>
  namespace: <NAMESPACE>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: <DEPLOY_NAME>
  template:
    spec:
      containers:
        - name: <CONTAINER_NAME>
          image: <REGISTRY>/<REPO>:<git-sha>
          ports:
            - containerPort: 8000
          envFrom:
            - secretRef:
                name: app-env    # Populated from your env file secret
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"

Service (LoadBalancer)

apiVersion: v1
kind: Service
metadata:
  name: <DEPLOY_NAME>
  namespace: <NAMESPACE>
  annotations:
    # Azure:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    # AWS:
    # service.beta.kubernetes.io/aws-load-balancer-source-ranges: "10.0.0.0/8"
spec:
  type: LoadBalancer
  selector:
    app: <DEPLOY_NAME>
  ports:
    - port: 8000
      targetPort: 8000

Note: The Azure annotation restricts the LoadBalancer to internal IPs. The AWS annotation limits source CIDR ranges. Remove or update these if you need public internet access.


13. First vs Repeated Deployments

Situation What the Pipeline Does
First time — Deployment not found in cluster Creates Deployment + Service from scratch
Subsequent runs — Deployment already exists Only updates the container image using kubectl set image

This is detected automatically:

kubectl get deployment <DEPLOY_NAME> -n <NAMESPACE>
# → found     : update image only
# → not found : create full Deployment + Service

Similarly for infrastructure services:

kubectl get statefulset elasticsearch -n kafkatest
# → found     : skip (already running)
# → not found : apply the YAML manifest

14. Infrastructure Services Reference

The deployInfra job manages these shared infrastructure services:

Service Namespace Port(s) Kubernetes Manifest Purpose
Elasticsearch kafkatest 9200 k8s/infra/iaf-es.yaml Log storage and search backend
OpenTelemetry Collector kafkatest 4318 (HTTP), 4319 (gRPC), 4320 (logs) k8s/infra/iaf-otel.yaml Traces and log collection
Phoenix (Arize) kafkatest 6006 k8s/infra/iaf-arpx.yaml LLM observability platform
Grafana kafkatest 3000 k8s/infra/iaf-grafana.yaml Monitoring dashboards
Redis iafv4 6379 k8s/infra/iaf-redis.yaml Caching and session storage

These services are injected as environment variables into the application pod via the Kubernetes app-env secret. The following keys are automatically set by the pipeline:

PHOENIX_COLLECTOR_ENDPOINT=http://<PHOENIX_IP>:6006
REDIS_HOST=<REDIS_IP>
REDIS_PORT=6379
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://<OTEL_IP>:4318/v1/traces
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT_HTTP=http://<OTEL_IP>:4320/v1/logs
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT_GRPC=<OTEL_IP>:4319
ELASTICSEARCH_URL=http://<ES_IP>:9200
GRAFANA_URL=http://<GRAFANA_IP>:3000

15. Troubleshooting

Common Errors — All Hyperscalers

Error / Problem Cause Fix
app-env secret is empty Env file secret not set in GitHub Re-add the secret with full .env content
Pods CrashLoopBackOff App error or missing env variable Run kubectl logs -n <NAMESPACE> -l app=<DEPLOY_NAME>
ImagePullBackOff Image not found or registry auth failed Check image tag and registry login step
Namespace not found Kubeconfig or RBAC issue Verify kubeconfig and role bindings
envsubst: not found gettext not installed sudo apt-get install -y gettext
LoadBalancer stuck Pending Cloud LB controller not installed Install the appropriate cloud load balancer controller

Azure (AKS) Specific

Error / Problem Cause Fix
Azure login failed Service principal credentials wrong or expired Re-add AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID secrets
Cannot connect to AKS Wrong cluster name or resource group Verify CLUSTER_NAME and RESOURCE_GROUP env vars
ACR push access denied Service principal missing AcrPush role Assign AcrPush role to SP on the ACR resource
kubelogin: not found kubelogin not installed on runner Run az aks install-cli on the runner
Elasticsearch IP UNKNOWN Service did not get a LoadBalancer IP in 6 min Check Azure LB quota, or AKS node pool health

AWS (EKS) Specific

Error / Problem Cause Fix
AWS credentials error Secrets missing or session token expired Re-add secrets in GitHub → Settings → Secrets
Cannot connect to EKS Wrong cluster name or runner lacks IAM permissions Verify EKS_CLUSTER_NAME and IAM role on runner
ECR push access denied IAM user missing ECR push permission Attach AmazonEC2ContainerRegistryFullAccess policy
nslookup: not found dnsutils not installed sudo apt-get install -y dnsutils
LoadBalancer stuck Pending AWS Load Balancer Controller not installed Install AWS LB Controller

GCP (GKE) Specific

Error / Problem Cause Fix
GCP auth failed Service account key invalid or expired Re-generate and re-add GCP_SA_KEY secret
Cannot connect to GKE Wrong cluster name or zone Verify GKE_CLUSTER and GKE_ZONE env vars
Artifact Registry push denied SA missing artifactregistry.writer role Assign roles/artifactregistry.writer to the SA
gke-gcloud-auth-plugin not found Plugin not installed gcloud components install gke-gcloud-auth-plugin
LoadBalancer stuck Pending GCP quota or firewall rules blocking Check GCP quotas and VPC firewall rules

16. Quick Checklist

Use this checklist when setting up for a new repository.

Azure (AKS)

INFRASTRUCTURE
[ ] Azure Container Registry (ACR) created
[ ] AKS cluster running and reachable
[ ] Service Principal created with AcrPush + AKS deploy permissions

RUNNER SETUP (VM)
[ ] Linux VM provisioned (Ubuntu 20.04/22.04, 2 vCPU, 4GB RAM, 30GB disk)
[ ] docker installed on VM
[ ] az CLI installed on VM
[ ] kubelogin installed on VM (az aks install-cli)
[ ] kubectl installed on VM
[ ] Runner registered in GitHub → Settings → Actions → Runners
[ ] Runner labels set to: self-hosted, Linux, X64, <your-label>
[ ] Runner running as a system service (sudo ./svc.sh install && start)
[ ] Runner shows as Idle (green) in GitHub → Settings → Actions → Runners

GITHUB SECRETS
[ ] AZURE_CLIENT_ID added
[ ] AZURE_CLIENT_SECRET added
[ ] AZURE_TENANT_ID added
[ ] AZURE_SUBSCRIPTION_ID added
[ ] APP_ENV_FILE_AZURE added (full .env file content)

REPOSITORY SETUP
[ ] k8s/infra/ manifests present (iaf-es.yaml, iaf-otel.yaml, iaf-arpx.yaml, iaf-grafana.yaml, iaf-redis.yaml)
[ ] Dockerfile at root of repo
[ ] .github/workflows/deploy-aks.yml created
[ ] env: block in YAML updated (ACR, resource group, cluster, namespace)
[ ] Branch name updated in the workflow trigger
[ ] Code committed and pushed to trigger the pipeline

AWS (EKS)

INFRASTRUCTURE
[ ] ECR repository created in AWS
[ ] EKS cluster running and reachable
[ ] IAM user/role has ECR push + EKS deploy permissions
[ ] AWS Load Balancer Controller installed on EKS

RUNNER SETUP (VM)
[ ] Linux VM provisioned (Ubuntu 20.04/22.04, 2 vCPU, 4GB RAM, 30GB disk)
[ ] docker installed on VM
[ ] aws CLI installed on VM
[ ] kubectl installed on VM
[ ] envsubst installed  (sudo apt-get install -y gettext)
[ ] nslookup installed  (sudo apt-get install -y dnsutils)
[ ] Runner registered in GitHub → Settings → Actions → Runners
[ ] Runner labels set to: self-hosted, Linux, X64, <your-label>
[ ] Runner running as a system service
[ ] Runner shows as Idle (green)

GITHUB SECRETS
[ ] AWS_ACCESS_KEY_ID added
[ ] AWS_SECRET_ACCESS_KEY added
[ ] AWS_SESSION_TOKEN added (if using temporary credentials)
[ ] AWS_REGION added
[ ] IAF_BACKEND_ENV added (full .env file content)

REPOSITORY SETUP
[ ] Dockerfile at root of repo
[ ] .github/workflows/deploy-eks.yml created
[ ] env: block in YAML updated (ECR registry, EKS cluster, namespace)
[ ] Branch name updated in the workflow trigger
[ ] Code committed and pushed to trigger the pipeline

GCP (GKE)

INFRASTRUCTURE
[ ] GCP project with GKE and Artifact Registry APIs enabled
[ ] Artifact Registry repository created
[ ] GKE cluster running and reachable
[ ] Service Account created with container.developer + artifactregistry.writer roles

RUNNER SETUP (VM)
[ ] Linux VM provisioned (Ubuntu 20.04/22.04, 2 vCPU, 4GB RAM, 30GB disk)
[ ] docker installed on VM
[ ] gcloud CLI installed on VM
[ ] gke-gcloud-auth-plugin installed (gcloud components install gke-gcloud-auth-plugin)
[ ] kubectl installed on VM
[ ] envsubst installed  (sudo apt-get install -y gettext)
[ ] Runner registered in GitHub → Settings → Actions → Runners
[ ] Runner labels set to: self-hosted, Linux, X64, <your-label>
[ ] Runner running as a system service
[ ] Runner shows as Idle (green)

GITHUB SECRETS
[ ] GCP_SA_KEY added (base64-encoded service account JSON)
[ ] GCP_PROJECT_ID added
[ ] APP_ENV_FILE_GCP added (full .env file content)

REPOSITORY SETUP
[ ] Dockerfile at root of repo
[ ] .github/workflows/deploy-gke.yml created
[ ] env: block in YAML updated (GAR location, repository, GKE cluster, zone, namespace)
[ ] Branch name updated in the workflow trigger
[ ] Code committed and pushed to trigger the pipeline