ML Ops Templates

Always reference a tagged release (or commit SHA) when invoking these reusable workflows so the scripts and workflows stay in lock-step with the version you audited.

Versioning & Release Workflow

Work from main; land changes via PRs so CI runs on the merge commit.
Once the changes you want to release are on main, push the branch first (git push origin main).
Create a lightweight tag using semver-style names (vX.Y.Z). Example: git tag v1.0.0.
Push the tag separately (git push origin v1.0.0) so consumers can pin workflows to the exact version.
Update release notes in GitHub (or a CHANGELOG) when the tag publishes, and notify downstream repos to bump their uses: ...@vX.Y.Z and templates_repo_ref pins.

Why lightweight tags? GitHub Actions has a known limitation where nested reusable workflows with relative paths (e.g., terraform_azure_ml_environment.yml calling read-yaml.yml) fail to resolve when referenced via annotated tags. Lightweight tags work correctly.

Building an image

Use build.yml in a github workflow in the ML project, like:

build-dev:
    needs: pre-commit
    if: github.ref == 'refs/heads/prime/deploy'
    uses: Eedi/ml-mlops-templates/.github/workflows/build.yml@vX.Y.Z
    with:
        environment: anet-dev
        environment_config_file: config-infra-anet-dev.yml
        ml_env_name: ml-azua-litserve-env # must match the environment name in the deployment config
        ml_env_description: "Environment for ml-azua serving"
        target_layer: litserve
        tags: "'team=data-science' 'repo=ml-azua' 'model=dynamic-vae'"
        maximise_disk_space: false
    secrets: inherit
    permissions:
        id-token: write
        contents: read

Arguments:

environment: Name of the github environment
environment_config_file: Name of the project's environment config file.
ml_env_name: Name of the Azure ML environment resource. Must match the environment name in the deployment config yaml.
ml_env_description: Description for the Azure ML environment resource
tags: Tags for the Azure ML environment resource
target_layer: Docker layer to build. Inference layers should be lightweight. Dev python deps should be segegated using Docker layering and Poetry grouping.
maximise_disk_space: Allows you to build larger images. Inference layers shouldn't need this. Removes pre-installed resources from the github agent, at the cost of a slightly longer run time.

Deployment to an Azure ML Realtime Endpoint

Use deploy_realtime.yml in a github workflow in the ML project, like:

deploy-nbq-dev:
needs: build-dev
uses: Eedi/ml-mlops-templates/.github/workflows/deploy_realtime.yml@vX.Y.Z
with:
    environment: anet-dev
    environment_config_file: config-infra-anet-dev.yml
    deploy: false
    run_load_test: false
    load_test_config: ./mlops/azureml/configs/project_prime/load_test_nbq.yml
    load_test_name: nbq-allcandidates_alltargets_10x # must change if test plan changes
    endpoint_name: anet-ep-dev
    deployment_name: anet-eedi
    endpoint_config: ./mlops/azureml/configs/project_prime/endpoint_dev.yml
    deployment_config: ./mlops/azureml/configs/project_prime/deployment_dev.yml
secrets: inherit
permissions:
    id-token: write
    contents: read

Required Arguments:

environment: Name of the github environment
environment_config_file: Name of the project's environment config file.
deploy: whether to update the endpoint and deploy (including shadow deployment) or just to update endpoint resources e.g. diagnostic settings, load test, alert rules.
run_load_test: true if a load test run is required
endpoint_name: name of the endpoint resource
endpoint_config: path to config file of endpoint resource
deployment_name: name of primary deployment
deployment_config: path to config file of primary deployment

Optional Arguments:

load_test_config: config file path for load test
load_test_name: this must be changed if the test plan (e.g. locustfile) changes
shadow_deployment_name: deployment_name of shadow deployment
shadow_deployment_config: deployment_config path of shadow deployment
shadow_deploment_mirror_percentage: % of traffic to mirror to shadow deployment
shadow_deployment_traffic_percentage: % of live traffic to send to shadow deployment. Can't be set when creating/updating a shadow deployment. The traffic % to the primary endpoint will be set to 100 - shadow_deployment_traffic_percentage

Controlling deployment behaviour:

Deploying a primary deployment:
- Unset any shadow deployment vars
Updating diagnostic settings or alerts without deploying
- Set deploy to false
Running a load test:
- Set run_load_test to true
Shadow deployments:
- Creating or updating a shadow deployment
  - Define shadow_deployment_config and shadow_deploment_mirror_percentage
- Routing live traffic to a shadow deployment
  - Define shadow_deployment_traffic_percentage
- Updating a shadow deployment and routing live traffic to it in one go
  - Don't. Test first.

Shadow Deployment Run Book

Create new deployment config. Don't give the deployment the name of xxx_shadow or similar - it will become the primary deployment when 100% of the live traffic is routed to it!
Deploy and load test on dev as usual
Then with an existing deployment already on a production endpoint...
Run workflow to do shadow deployment and mirror some traffic to it
Test the shadow deployment. Check it's stats from the mirrored data.
Run workflow again to update traffic %s
Once the former shadow deployment is at 100% live traffic, the old deployment can be deleted. Currently a manual process.
Cleanup:
- Udpate the primary deployment name to be the same as the former shadow deployment name
- Remove the redundant config.
- Remove the shadow deployment params from the workflow

In short:

Run once to create shadow deployment
Test
Run again to route live traffic to it

For updating traffic to 100% wouldn't it be better to simply change the primary deployment?

No, it causes an outage.

Does the primary deployment update if a shadow deployment is defined?

No.

How do load testing, logs, alerts work with shadow deployments?

Load testing is run against the endpoint. It will use whatever traffic routing is currently defined.
Alert rules are defined at the endpoint level.
Diagnostic settings (for logging) are defined at the endpoint level.

To make updates to endpoint level resources, no shadow deployment variables should be defined.

How are container images managed for shadow deployments?

Assuming the shadow deployment config points at .azurecr.io/envname:latest then it will simply use the latest environment image.

Terraform infrastructure

The reusable terraform_azure_ml_environment.yml workflow handles planning and applying Terraform changes:

jobs:
  terraform-dev:
    uses: Eedi/ml-mlops-templates/.github/workflows/terraform_azure_ml_environment.yml@vX.Y.Z
    with:
      environment: anet-dev
      environment_config_file: config-infra-anet-dev.yml
      apply: true   # switch to false to run plan-only

Inputs:

environment – GitHub environment that should receive the deployment.
environment_config_file – Path to the environment YAML consumed by read-yaml.yml.
terraform_directory – Path (relative to the workspace) where the Terraform code lives; defaults to ml-mlops-templates/terraform/ml-environment.
apply – Set to true to run terraform apply after a successful plan.
import_script – Optional helper (defaults to ../../scripts/terraform/import_unmanaged_resources.sh relative to the Terraform directory) run before terraform apply. Override with a path relative to the Terraform working directory if your project needs extra imports.
repo_name – Derived automatically from github.event.repository.name and injected via TF_VAR_repo_name for tagging.

The workflow expects the calling repository to provide the same Azure credentials used by the training and deployment pipelines (ARM_CLIENT_ID, ARM_SUBSCRIPTION_ID, ARM_TENANT_ID, ARM_BACKEND_*, etc.).

Training pipelines

train_pipeline.yml submits a materialised Azure ML pipeline definition:

jobs:
  train-prime:
    uses: Eedi/ml-mlops-templates/.github/workflows/train_pipeline.yml@vX.Y.Z
    with:
      environment: anet-dev
      environment_config_file: config-infra-anet-dev.yml
      pipeline_config_path: configs/materialised/pipelines/train_pipeline_eedi_dev.yml
      run_build: true
      ml_env_name: ml-azua-train-env
      ml_env_description: "Environment for ml-azua training"
      tags: "'team=data-science' 'repo=ml-azua'"
      maximise_disk_space: true

Inputs:

environment, environment_config_file – Same contract as build.yml / terraform_azure_ml_environment.yml.
pipeline_config_path – Path inside the calling repository to the rendered pipeline YAML.
run_build – Rebuild the training environment before submission. When true, ml_env_name, ml_env_description, target_layer, tags, and maximise_disk_space are forwarded to build.yml.

The workflow reuses build.yml when requested, then calls scripts/training/train_pipeline.sh to execute az ml job create.

Name		Name	Last commit message	Last commit date
Latest commit History 724 Commits
.github		.github
read_yaml_action		read_yaml_action
scripts		scripts
src		src
templates		templates
terraform		terraform
.gitignore		.gitignore
LICENSE.MD		LICENSE.MD
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Ops Templates

Versioning & Release Workflow

Building an image

Deployment to an Azure ML Realtime Endpoint

Shadow Deployment Run Book

In short:

For updating traffic to 100% wouldn't it be better to simply change the primary deployment?

Does the primary deployment update if a shadow deployment is defined?

How do load testing, logs, alerts work with shadow deployments?

How are container images managed for shadow deployments?

Terraform infrastructure

Training pipelines

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML Ops Templates

Versioning & Release Workflow

Building an image

Deployment to an Azure ML Realtime Endpoint

Shadow Deployment Run Book

In short:

For updating traffic to 100% wouldn't it be better to simply change the primary deployment?

Does the primary deployment update if a shadow deployment is defined?

How do load testing, logs, alerts work with shadow deployments?

How are container images managed for shadow deployments?

Terraform infrastructure

Training pipelines

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages