MLOps (Machine Learning Operations)

with MLFlow and TFX from Google

Overview

Generic MLOps Lifecycle
Cloud-specific workflows:
Security Mitigations
TFX for ML
Microsoft Azure DevOps
MLFlow
AutoML
Video courses
Papers
References
References:

MLOps makes machine learning workloads robust and reproducible. For example, you’ll be able to monitor, retrain, and redeploy a model whenever needed while always keeping a model in production.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

Generic MLOps Lifecycle

Machine Learning Operations (MLOps) makes the machine learning lifecycle scalable using DevOps principles and tools.

NOTE: ML pipeline workflows are, as of this writing, almost always one-way DAGs.

This reference flow from the CD Foundation: > Click image for full screen

is shown in the DeepLearning.ai course.

https://github.com/cdfoundation/cdf-landscape

“Inner loop”:
1). Train model

Plan model requirements and performance metrics
Create model (and scoring scripts)
Verify code and model quality
Configure: Standardize infra configuration with Infrastructure as Code (IaC)

“Outer loop”:
2). Package model
3). Validate model
4). Deploy & Release model (to production)
5). Monitor model (in productive use)
6). Retrain model

Workflow Topics:

Model Deployment

New product
Automate/assist with manual task
Replace previous system - gradual ramp-up, roll-back

Canary deployment first to a small fraction of traffic.

PROTIP: Shadow-mode allows human in parallel. ML output not used for any decisions during this phase.

AI assistance -> Partial automation -> Full automation

Logging & Monitoring

Software metrics: Memory, compute, latancy, throughput, server load
Input metrics: Avg. input length, volume, num missing values, Avg image brightness
Output metrics: # times return null, user redoes search, switches to typing

Data Versioning

managing large-sized datasets (100TB+) in cloud storage systems like S3

Realtime or batch?
Cloud vs. Edge/Browser
- Less network bandwidth needed
- Lower latency
- Can function even if network connection is down
Compute resources (CPU/GPU/memory)
Latency, Throughput (QPS)
Security and privacy

Monitor for concept and data drift

Practices for first deployment is different than later.

Traceability

OpenTelemetry?

Cloud-specific workflows:

TFX (TensorFlow eXtended)
Microsoft Azure DevOps
MLFlow for Google
AutoML for AWS SageMaker

Security Mitigations

One of the main reasons to provide a Machine Learning pipeline for Data Scientists to run is that security processes are automatically included without as much manual toil.

Mitre (a non-profit research lab funded by the US government) defined their Mitre ATT&CK to present – for each stage in a typical “kill chain” – the TTPs (Tactics + Techniques + Procedures) adversaries use to attack computer systems. Use it to analyze the kill chain adversaries could possibly use to get in, do damage, and cover their tracks. All to prevent that in the future.

https://atlas.mitre.org adds columns for AI & ML (Machine Learning):

Click diagram for full screen

VIDEO: Intro
Dr. Christina Liaghati explains the MLSecOps needed to harden against the TTPs
AI Threats & Vulnerabilities
SANS AI Security Trends by Kirk Trychel

Below presents ATLAS Mitigations without the need to click around:

Mitigation Name	Description	Mechanisms
Limit Release of Public Information	Limit the public release of technical information about the machine learning stack used in an organization's products or services. Technical knowledge of how machine learning is used can be leveraged by adversaries to perform targeting and tailor attacks to the target system. Additionally, consider limiting the release of organizational information - including physical locations, researcher names, and department structures - from which technical details such as machine learning techniques, model architectures, or datasets may be inferred.	-
Limit Model Artifact Release	Limit public release of technical project details including data, algorithms, model architectures, and model checkpoints that are used in production, or that are representative of those used in production.	-
Passive ML Output Obfuscation	Decreasing the fidelity of model outputs provided to the end user can reduce an adversaries ability to extract information about the model and optimize attacks for the model.	-
Model Hardening	Use techniques to make machine learning models robust to adversarial inputs such as adversarial training or network distillation.	-
Restrict Number of ML Model Queries	Limit the total number and rate of queries a user can perform.	API Gateway
Control Access to ML Models and Data at Rest	Establish access controls on internal model registries and limit internal access to production models. Limit access to training data only to approved users.	IAM & PAM
Use Ensemble Methods	Use an ensemble of models for inference to increase robustness to adversarial inputs. Some attacks may effectively evade one model or model family but be ineffective against others.	-
Sanitize Training Data	Detect and remove or remediate poisoned training data. Training data should be sanitized prior to model training and recurrently for an active learning model. Implement a filter to limit ingested training data. Establish a content policy that would remove unwanted content such as certain explicit or offensive language from being used.	-
Validate ML Model	Validate that machine learning models perform as intended by testing for backdoor triggers or adversarial bias. Monitor model for concept drift and training data drift, which may indicate data tampering and poisoning.	-
Use Multi-Modal Sensors	Incorporate multiple sensors to integrate varying perspectives and modalities to avoid a single point of failure susceptible to physical attacks.	Data Cleansing
Input Restoration	Preprocess all inference data to nullify or reverse potential adversarial perturbations.	-
Restrict Library Loading	Prevent abuse of library loading mechanisms in the operating system and software to load untrusted code by configuring appropriate library loading mechanisms and investigating potential vulnerable software. File formats such as pickle files that are commonly used to store machine learning models can contain exploits that allow for loading of malicious libraries.	-
Encrypt Sensitive Information	Encrypt sensitive data such as ML models to protect against adversaries attempting to access sensitive data.	-
Code Signing	Enforce binary and application integrity with digital signature verification to prevent untrusted code from executing. Adversaries can embed malicious code in ML software or models. Enforcement of code signing can prevent the compromise of the machine learning supply chain and prevent execution of malicious code.	-
Verify ML Artifacts	Verify the cryptographic checksum of all machine learning artifacts to verify that the file was not modified by an attacker.	-
Adversarial Input Detection	Detect and block adversarial inputs or atypical queries that deviate from known benign behavior, exhibit behavior patterns observed in previous attacks or that come from potentially malicious IPs. Incorporate adversarial detection algorithms into the ML system prior to the ML model.	-
Vulnerability Scanning	Vulnerability scanning is used to find potentially exploitable software vulnerabilities to remediate them. File formats such as pickle files that are commonly used to store machine learning models can contain exploits that allow for arbitrary code execution. Both model artifacts and downstream products produced by models should be scanned for known vulnerabilities.	-
Model Distribution Methods	Deploying ML models to edge devices can increase the attack surface of the system. Consider serving models in the cloud to reduce the level of access the adversary has to the model. Also consider computing features in the cloud to prevent gray-box attacks, where an adversary has access to the model preprocessing methods.	-
User Training	Educate ML model developers on secure coding practices and ML vulnerabilities.	-
Control Access to ML Models and Data in Production	Require users to verify their identities before accessing a production model. Require authentication for API endpoints and monitor production model queries to ensure compliance with usage policies and to prevent model misuse.	-

TFX for ML

VIDEO: TFX (TensorFlow eXtended) is an open-source end-to-end platform for deploying production ML pipelines. It’s created and used at Google and other Alphabet companies as well as Twitter, AirBnB, PayPal.

Modules (each a different GitHub repo):

TF ML Metadata
TF Data Validation
TF Transform
TF Model Analysis
TF Serving

Static datasets are used in prototyping and ML research. Dynamic datasets are used in production.

TensorFlow Lite deploys to mobile devices.

Hidden technical debt in ML systems

Konstantinos, Katsiapis, Karmarkar, A., Altay, A., Zaks, A., Polyzotis, N., … Li, Z. (2020). Towards ML Engineering: A brief history of TensorFlow Extended (TFX). http://arxiv.org/abs/2010.02013

Microsoft Azure DevOps

On Azure, MLOps makes use of Azure DevOps, which includes Boards, Repos, Pipelines. After Microsoft bought GitHub, DevOps now also includes GitHub repos and Actions CI/CD.

https://azure.microsoft.com/services/devops/?portal=true

Setup by Administrator: sets up the DevOps environment and manages the tools.

Connect Azure Machine Learning with either Azure DevOps or GitHub.
When an Azure DevOps project is created, you can connect to an existing Azure Machine Learning workspace:
Within a project, go to Project Settings.
Select service connections and create a new one.
Choose Azure Resource Manager.
Choose to authenticate with an automatic Service Principal.
Set the scope level to Machine Learning Workspace and connect to an existing Azure Machine Learning workspace you have access to.
Grant access permission to all pipelines.
Give your service connection a name. You’ll use the name whenever you need to authenticate Azure DevOps to manage the Azure Machine Learning workspace.
Sign into GitHub with an Org Admin account.
Create a GitHub repository.

Store credentials in GitHub:
Go to your repository’s Settings.
Navigate to the Secrets page.
Select Actions.
Add a new repository secret.
Enter AZURE_CREDENTIALS as the name.
Paste in the output JSON with the credentials and add the secret.

Use Azure DevOps to manage the Azure Machine Learning workspace.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-azure-devops

PROTIP: Use RBAC roles to limit access to the workspace. Use a system to create and manage the workspace and its users so that there is automation with safeguards such as encryption and proper log analytics.

Create an Azure Machine Learning workspace

Python:

from azure.ai.ml.entities import Workspace
workspace_name = "myworkspace"
ws_basic = Workspace.create(name=workspace_name,
                            subscription_id="00000000-0000-0000-0000-000000000000",
                            resource_group="myresourcegroup",
                            location="eastus",
                            exist_ok=True)
ml_client_workspace.begin_create(ws_basic)

Developers as end users

contributes to the project by collaborating on the development. Connects to the tools but has restricted access to the configuration of the DevOps environment.

MLFlow

MLflow.org is an open-source platform for the machine learning lifecycle. It supports Google’s TensorFlow Machine Learning toolchain at https://github.com/mlflow/mlflow

https://www.databricks.com/product/managed-mlflow

From Alfredo Deza and Noah Gift of Pragmatic AI Solutions:

OReilly Course: MLOps Platforms From Zero: Databricks, MLFlow/MLRun/SKLearn
MLOps Masterclass: Theory to DevOps to Cloud-native to AutoML
Book: Practical MLOps Video of Foundations (Bash, make, AWS Cloud Shell, Cloud9, collaborate)

time shuf -n 1000 myfile.tsv myfile.1k.tsv

From Yaron Haviv and Noah Gift

Implementing MLOps in the Enterprise September 2023

AutoML

AWS SageMaker

Video courses

https://www.deeplearning.ai/program/machine-learning-engineering-for-production-mlops/ 4 courses on Coursera

https://www.coursera.org/programs/mckinsey-learning-program-uedvm

https://www.coursera.org/learn/introduction-to-machine-learning-in-production/home/welcome
Machine Learning Data Lifecycle in Production
Machine Learning Modeling Pipelines in Production
Deploying Machine Learning Models in Production

Robert Crowe, Instructor, TensorFlow Developer Engineer, Google
Laurence Moroney, Instructor, Lead AI Advocate, Google
Cristian Bartolomé Arámburu, Curriculum Developer, Founding Engineer, Pulsar

Rashid Ali - https://www.linkedin.com/pulse/configuring-azure-devops-selenium-ui-tests-rashid-ali/

https://towardsdatascience.com/machine-learning-in-production-why-you-should-care-about-data-and-concept-drift-d96d0bc907fb Concept and Data Drift

https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/ Monitoring ML Models

A Chat with Andrew on MLOps: From Model-centric to Data-centric (slides) on the DeepLearning.ai YouTube channel

Papers

Konstantinos, Katsiapis, Karmarkar, A., Altay, A., Zaks, A., Polyzotis, N., … Li, Z. (2020). Towards ML Engineering: A brief history of TensorFlow Extended (TFX). http://arxiv.org/abs/2010.02013

Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2020). Challenges in deploying machine learning: A survey of case studies. http://arxiv.org/abs/2011.09926

Sculley, D., Holt, G., Golovin, D., Davydov, E., & Phillips, T. (n.d.). Hidden technical debt in machine learning systems. Retrieved April 28, 2021, from Nips.c https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

LAB https://github.com/https-deeplearning-ai/MLEP-public/tree/main/course1/week1-ungraded-lab

Conditions: Speech recognition - accuracy vs human level performance (HLP)

Clear Speech - 94% vs. 95% - 1% - 60% of data
Car Noise - 89% vs. 93% - 4% - 4%
People Noise - 87% vs. 89% - 2% - 30%
Low Bandwidth - 70% vs. 70% (no difference)

https://www.coursera.org/learn/introduction-to-machine-learning-in-production/lecture/AjC2P/performance-auditing

Performance on sub-sets of data (ethnicity, gender)
How common are certain errors (FP, FN)
Performance on rare cases

https://www.coursera.org/learn/introduction-to-machine-learning-in-production/lecture/B9eMQ/experiment-tracking Experiment tracking

Algorithm/code versioning
Dataset used
Hyperparameters
Results, with summary metrics/analysis
Info needed to replicate results
Resources used

References

https://blog.ml.cmu.edu/2020/08/31/3-baselines/ Establishing a baseline

https://techcommunity.microsoft.com/t5/azure-ai/responsible-machine-learning-with-error-analysis/ba-p/2141774 Error analysis

https://neptune.ai/blog/ml-experiment-tracking Experiment tracking

Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., … Anderljung, M. (n.d.). Toward trustworthy AI development: Mechanisms for supporting verifiable claims∗. Retrieved May 7, 2021 http://arxiv.org/abs/2004.07213v2

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. Retrieved from http://arxiv.org/abs/1912.02292

Data provenance refers to where the data comes from, and data lineage the sequence of processing steps applied to it.

Direct Labeling (aka Process Feedback) labels come from monitoring predictions, not from a “rater”. It needs to match prediction results with their corresponding original inference request.

Complicating production machine learning?

Model Retraining driven by declining model performance.
Labeling throug Weak Supervision

TFDV (TensorFlow Data Validation) DEFINITIONS: https://www.coursera.org/learn/machine-learning-data-lifecycle-in-production/lecture/OihrT/detecting-data-issues Drift refers to changes in data over time, such as data collected once a day. Seasonality.

Concept (covariant) drift

Skew refers to the difference between to static versions, or different sources, such as training set and serving set.

Schema skew such as text to numeric
Distribution skew of features from different data
Feature skew

Detect skew by comparing baseline training and serving data

References:

https://cd.foundation/blog/2020/02/11/announcing-the-cd-foundation-mlops-sig/ MLops
https://medium.com/@karpathy/software-2-0-a64152b37c35 Data 1st class citizen
https://pair.withgoogle.com/chapter/data-collection/ Runners app
https://developers.google.com/machine-learning/guides/rules-of-ml Rules of ML
https://ai.googleblog.com/2018/09/introducing-inclusive-images-competition.html Bias in datasets
https://www.elastic.co/logstash Logstash
https://www.fluentd.org/ Fluentd
https://cloud.google.com/logging/ Google Cloud Logging
AWS ElasticSearch
https://azure.microsoft.com/en-us/services/monitor/ Azure Monitor
https://blog.tensorflow.org/2018/09/introducing-tensorflow-data-validation.html TFDV
https://en.wikipedia.org/wiki/Chebyshev_distance Chebyshev distance

Wilson Mar