Skip to content

Eedi/ml-mlops-templates

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

724 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Ops Templates

Always reference a tagged release (or commit SHA) when invoking these reusable workflows so the scripts and workflows stay in lock-step with the version you audited.

Versioning & Release Workflow

  • Work from main; land changes via PRs so CI runs on the merge commit.
  • Once the changes you want to release are on main, push the branch first (git push origin main).
  • Create a lightweight tag using semver-style names (vX.Y.Z). Example: git tag v1.0.0.
  • Push the tag separately (git push origin v1.0.0) so consumers can pin workflows to the exact version.
  • Update release notes in GitHub (or a CHANGELOG) when the tag publishes, and notify downstream repos to bump their uses: ...@vX.Y.Z and templates_repo_ref pins.

Why lightweight tags? GitHub Actions has a known limitation where nested reusable workflows with relative paths (e.g., terraform_azure_ml_environment.yml calling read-yaml.yml) fail to resolve when referenced via annotated tags. Lightweight tags work correctly.

Building an image

Use build.yml in a github workflow in the ML project, like:

build-dev:
    needs: pre-commit
    if: github.ref == 'refs/heads/prime/deploy'
    uses: Eedi/ml-mlops-templates/.github/workflows/build.yml@vX.Y.Z
    with:
        environment: anet-dev
        environment_config_file: config-infra-anet-dev.yml
        ml_env_name: ml-azua-litserve-env # must match the environment name in the deployment config
        ml_env_description: "Environment for ml-azua serving"
        target_layer: litserve
        tags: "'team=data-science' 'repo=ml-azua' 'model=dynamic-vae'"
        maximise_disk_space: false
    secrets: inherit
    permissions:
        id-token: write
        contents: read

Arguments:

  • environment: Name of the github environment
  • environment_config_file: Name of the project's environment config file.
  • ml_env_name: Name of the Azure ML environment resource. Must match the environment name in the deployment config yaml.
  • ml_env_description: Description for the Azure ML environment resource
  • tags: Tags for the Azure ML environment resource
  • target_layer: Docker layer to build. Inference layers should be lightweight. Dev python deps should be segegated using Docker layering and Poetry grouping.
  • maximise_disk_space: Allows you to build larger images. Inference layers shouldn't need this. Removes pre-installed resources from the github agent, at the cost of a slightly longer run time.

Deployment to an Azure ML Realtime Endpoint

Use deploy_realtime.yml in a github workflow in the ML project, like:

deploy-nbq-dev:
needs: build-dev
uses: Eedi/ml-mlops-templates/.github/workflows/deploy_realtime.yml@vX.Y.Z
with:
    environment: anet-dev
    environment_config_file: config-infra-anet-dev.yml
    deploy: false
    run_load_test: false
    load_test_config: ./mlops/azureml/configs/project_prime/load_test_nbq.yml
    load_test_name: nbq-allcandidates_alltargets_10x # must change if test plan changes
    endpoint_name: anet-ep-dev
    deployment_name: anet-eedi
    endpoint_config: ./mlops/azureml/configs/project_prime/endpoint_dev.yml
    deployment_config: ./mlops/azureml/configs/project_prime/deployment_dev.yml
secrets: inherit
permissions:
    id-token: write
    contents: read

Required Arguments:

  • environment: Name of the github environment
  • environment_config_file: Name of the project's environment config file.
  • deploy: whether to update the endpoint and deploy (including shadow deployment) or just to update endpoint resources e.g. diagnostic settings, load test, alert rules.
  • run_load_test: true if a load test run is required
  • endpoint_name: name of the endpoint resource
  • endpoint_config: path to config file of endpoint resource
  • deployment_name: name of primary deployment
  • deployment_config: path to config file of primary deployment

Optional Arguments:

  • load_test_config: config file path for load test
  • load_test_name: this must be changed if the test plan (e.g. locustfile) changes
  • shadow_deployment_name: deployment_name of shadow deployment
  • shadow_deployment_config: deployment_config path of shadow deployment
  • shadow_deploment_mirror_percentage: % of traffic to mirror to shadow deployment
  • shadow_deployment_traffic_percentage: % of live traffic to send to shadow deployment. Can't be set when creating/updating a shadow deployment. The traffic % to the primary endpoint will be set to 100 - shadow_deployment_traffic_percentage

Controlling deployment behaviour:

  • Deploying a primary deployment:
    • Unset any shadow deployment vars
  • Updating diagnostic settings or alerts without deploying
    • Set deploy to false
  • Running a load test:
    • Set run_load_test to true
  • Shadow deployments:
    • Creating or updating a shadow deployment
      • Define shadow_deployment_config and shadow_deploment_mirror_percentage
    • Routing live traffic to a shadow deployment
      • Define shadow_deployment_traffic_percentage
    • Updating a shadow deployment and routing live traffic to it in one go
      • Don't. Test first.

Shadow Deployment Run Book

  • Create new deployment config. Don't give the deployment the name of xxx_shadow or similar - it will become the primary deployment when 100% of the live traffic is routed to it!
  • Deploy and load test on dev as usual
  • Then with an existing deployment already on a production endpoint...
  • Run workflow to do shadow deployment and mirror some traffic to it
  • Test the shadow deployment. Check it's stats from the mirrored data.
  • Run workflow again to update traffic %s
  • Once the former shadow deployment is at 100% live traffic, the old deployment can be deleted. Currently a manual process.
  • Cleanup:
    • Udpate the primary deployment name to be the same as the former shadow deployment name
    • Remove the redundant config.
    • Remove the shadow deployment params from the workflow

In short:

  • Run once to create shadow deployment
  • Test
  • Run again to route live traffic to it

For updating traffic to 100% wouldn't it be better to simply change the primary deployment?

No, it causes an outage.

Does the primary deployment update if a shadow deployment is defined?

No.

How do load testing, logs, alerts work with shadow deployments?

  • Load testing is run against the endpoint. It will use whatever traffic routing is currently defined.
  • Alert rules are defined at the endpoint level.
  • Diagnostic settings (for logging) are defined at the endpoint level.

To make updates to endpoint level resources, no shadow deployment variables should be defined.

How are container images managed for shadow deployments?

Assuming the shadow deployment config points at .azurecr.io/envname:latest then it will simply use the latest environment image.

Terraform infrastructure

The reusable terraform_azure_ml_environment.yml workflow handles planning and applying Terraform changes:

jobs:
  terraform-dev:
    uses: Eedi/ml-mlops-templates/.github/workflows/terraform_azure_ml_environment.yml@vX.Y.Z
    with:
      environment: anet-dev
      environment_config_file: config-infra-anet-dev.yml
      apply: true   # switch to false to run plan-only

Inputs:

  • environment – GitHub environment that should receive the deployment.
  • environment_config_file – Path to the environment YAML consumed by read-yaml.yml.
  • terraform_directory – Path (relative to the workspace) where the Terraform code lives; defaults to ml-mlops-templates/terraform/ml-environment.
  • apply – Set to true to run terraform apply after a successful plan.
  • import_script – Optional helper (defaults to ../../scripts/terraform/import_unmanaged_resources.sh relative to the Terraform directory) run before terraform apply. Override with a path relative to the Terraform working directory if your project needs extra imports.
  • repo_name – Derived automatically from github.event.repository.name and injected via TF_VAR_repo_name for tagging.

The workflow expects the calling repository to provide the same Azure credentials used by the training and deployment pipelines (ARM_CLIENT_ID, ARM_SUBSCRIPTION_ID, ARM_TENANT_ID, ARM_BACKEND_*, etc.).

Training pipelines

train_pipeline.yml submits a materialised Azure ML pipeline definition:

jobs:
  train-prime:
    uses: Eedi/ml-mlops-templates/.github/workflows/train_pipeline.yml@vX.Y.Z
    with:
      environment: anet-dev
      environment_config_file: config-infra-anet-dev.yml
      pipeline_config_path: configs/materialised/pipelines/train_pipeline_eedi_dev.yml
      run_build: true
      ml_env_name: ml-azua-train-env
      ml_env_description: "Environment for ml-azua training"
      tags: "'team=data-science' 'repo=ml-azua'"
      maximise_disk_space: true

Inputs:

  • environment, environment_config_file – Same contract as build.yml / terraform_azure_ml_environment.yml.
  • pipeline_config_path – Path inside the calling repository to the rendered pipeline YAML.
  • run_build – Rebuild the training environment before submission. When true, ml_env_name, ml_env_description, target_layer, tags, and maximise_disk_space are forwarded to build.yml.

The workflow reuses build.yml when requested, then calls scripts/training/train_pipeline.sh to execute az ml job create.

About

Azure MLOps (v2) solution accelerators. Enterprise ready templates to deploy your machine learning models on the Azure Platform.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 54.9%
  • HCL 35.3%
  • JavaScript 5.7%
  • Shell 4.1%