with MLFlow and TFX from Google
Overview
MLOps makes machine learning workloads robust and reproducible. For example, you’ll be able to monitor, retrain, and redeploy a model whenever needed while always keeping a model in production.
NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.
Generic MLOps Lifecycle
Machine Learning Operations (MLOps) makes the machine learning lifecycle scalable using DevOps principles and tools.
NOTE: ML pipeline workflows are, as of this writing, almost always one-way DAGs.
This reference flow from the CD Foundation: > Click image for full screen
is shown in the DeepLearning.ai course.
https://github.com/cdfoundation/cdf-landscape
“Inner loop”:
1). Train model
- Plan model requirements and performance metrics
- Create model (and scoring scripts)
- Verify code and model quality
- Configure: Standardize infra configuration with Infrastructure as Code (IaC)
“Outer loop”:
2). Package model
3). Validate model
4). Deploy & Release model (to production)
5). Monitor model (in productive use)
6). Retrain model
Workflow Topics:
Model Deployment
- New product
- Automate/assist with manual task
- Replace previous system - gradual ramp-up, roll-back
Canary deployment first to a small fraction of traffic.
PROTIP: Shadow-mode allows human in parallel. ML output not used for any decisions during this phase.
AI assistance -> Partial automation -> Full automation
Logging & Monitoring
- Software metrics: Memory, compute, latancy, throughput, server load
- Input metrics: Avg. input length, volume, num missing values, Avg image brightness
- Output metrics: # times return null, user redoes search, switches to typing
Data Versioning
managing large-sized datasets (100TB+) in cloud storage systems like S3
- Realtime or batch?
- Cloud vs. Edge/Browser
- Less network bandwidth needed
- Lower latency
- Can function even if network connection is down
- Compute resources (CPU/GPU/memory)
- Latency, Throughput (QPS)
- Security and privacy
Monitor for concept and data drift
Practices for first deployment is different than later.
Traceability
OpenTelemetry?
Cloud-specific workflows:
- TFX (TensorFlow eXtended)
- Microsoft Azure DevOps
- MLFlow for Google
- AutoML for AWS SageMaker
Security Mitigations
One of the main reasons to provide a Machine Learning pipeline for Data Scientists to run is that security processes are automatically included without as much manual toil.
Mitre (a non-profit research lab funded by the US government) defined their Mitre ATT&CK to present – for each stage in a typical “kill chain” – the TTPs (Tactics + Techniques + Procedures) adversaries use to attack computer systems. Use it to analyze the kill chain adversaries could possibly use to get in, do damage, and cover their tracks. All to prevent that in the future.
https://atlas.mitre.org adds columns for AI & ML (Machine Learning):
- VIDEO: Intro
- Dr. Christina Liaghati explains the MLSecOps needed to harden against the TTPs
- AI Threats & Vulnerabilities
- SANS AI Security Trends by Kirk Trychel
Below presents ATLAS Mitigations without the need to click around:
Mitigation Name | Description | Mechanisms |
---|---|---|
Limit Release of Public Information | Limit the public release of technical information about the machine learning stack used in an organization's products or services. Technical knowledge of how machine learning is used can be leveraged by adversaries to perform targeting and tailor attacks to the target system. Additionally, consider limiting the release of organizational information - including physical locations, researcher names, and department structures - from which technical details such as machine learning techniques, model architectures, or datasets may be inferred. | - |
Limit Model Artifact Release | Limit public release of technical project details including data, algorithms, model architectures, and model checkpoints that are used in production, or that are representative of those used in production. | - |
Passive ML Output Obfuscation | Decreasing the fidelity of model outputs provided to the end user can reduce an adversaries ability to extract information about the model and optimize attacks for the model. | - |
Model Hardening | Use techniques to make machine learning models robust to adversarial inputs such as adversarial training or network distillation. | - |
Restrict Number of ML Model Queries | Limit the total number and rate of queries a user can perform. | API Gateway |
Control Access to ML Models and Data at Rest | Establish access controls on internal model registries and limit internal access to production models. Limit access to training data only to approved users. | IAM & PAM |
Use Ensemble Methods | Use an ensemble of models for inference to increase robustness to adversarial inputs. Some attacks may effectively evade one model or model family but be ineffective against others. | - |
Sanitize Training Data | Detect and remove or remediate poisoned training data. Training data should be sanitized prior to model training and recurrently for an active learning model. Implement a filter to limit ingested training data. Establish a content policy that would remove unwanted content such as certain explicit or offensive language from being used. | - |
Validate ML Model | Validate that machine learning models perform as intended by testing for backdoor triggers or adversarial bias. Monitor model for concept drift and training data drift, which may indicate data tampering and poisoning. | - |
Use Multi-Modal Sensors | Incorporate multiple sensors to integrate varying perspectives and modalities to avoid a single point of failure susceptible to physical attacks. | Data Cleansing |
Input Restoration | Preprocess all inference data to nullify or reverse potential adversarial perturbations. | - |
Restrict Library Loading | Prevent abuse of library loading mechanisms in the operating system and software to load untrusted code by configuring appropriate library loading mechanisms and investigating potential vulnerable software. File formats such as pickle files that are commonly used to store machine learning models can contain exploits that allow for loading of malicious libraries. | - |
Encrypt Sensitive Information | Encrypt sensitive data such as ML models to protect against adversaries attempting to access sensitive data. | - |
Code Signing | Enforce binary and application integrity with digital signature verification to prevent untrusted code from executing. Adversaries can embed malicious code in ML software or models. Enforcement of code signing can prevent the compromise of the machine learning supply chain and prevent execution of malicious code. | - |
Verify ML Artifacts | Verify the cryptographic checksum of all machine learning artifacts to verify that the file was not modified by an attacker. | - |
Adversarial Input Detection | Detect and block adversarial inputs or atypical queries that deviate from known benign behavior, exhibit behavior patterns observed in previous attacks or that come from potentially malicious IPs. Incorporate adversarial detection algorithms into the ML system prior to the ML model. | - |
Vulnerability Scanning | Vulnerability scanning is used to find potentially exploitable software vulnerabilities to remediate them. File formats such as pickle files that are commonly used to store machine learning models can contain exploits that allow for arbitrary code execution. Both model artifacts and downstream products produced by models should be scanned for known vulnerabilities. | - |
Model Distribution Methods | Deploying ML models to edge devices can increase the attack surface of the system. Consider serving models in the cloud to reduce the level of access the adversary has to the model. Also consider computing features in the cloud to prevent gray-box attacks, where an adversary has access to the model preprocessing methods. | - |
User Training | Educate ML model developers on secure coding practices and ML vulnerabilities. | - |
Control Access to ML Models and Data in Production | Require users to verify their identities before accessing a production model. Require authentication for API endpoints and monitor production model queries to ensure compliance with usage policies and to prevent model misuse. | - |
TFX for ML
VIDEO: TFX (TensorFlow eXtended) is an open-source end-to-end platform for deploying production ML pipelines. It’s created and used at Google and other Alphabet companies as well as Twitter, AirBnB, PayPal.
Modules (each a different GitHub repo):
- TF ML Metadata
- TF Data Validation
- TF Transform
- TF Model Analysis
- TF Serving
Static datasets are used in prototyping and ML research. Dynamic datasets are used in production.
TensorFlow Lite deploys to mobile devices.
Hidden technical debt in ML systems
Konstantinos, Katsiapis, Karmarkar, A., Altay, A., Zaks, A., Polyzotis, N., … Li, Z. (2020). Towards ML Engineering: A brief history of TensorFlow Extended (TFX). http://arxiv.org/abs/2010.02013
Microsoft Azure DevOps
On Azure, MLOps makes use of Azure DevOps, which includes Boards, Repos, Pipelines. After Microsoft bought GitHub, DevOps now also includes GitHub repos and Actions CI/CD.
- https://azure.microsoft.com/services/devops/?portal=true
Setup by Administrator: sets up the DevOps environment and manages the tools.
- Connect Azure Machine Learning with either Azure DevOps or GitHub.
- When an Azure DevOps project is created, you can connect to an existing Azure Machine Learning workspace:
- Within a project, go to Project Settings.
- Select service connections and create a new one.
- Choose Azure Resource Manager.
- Choose to authenticate with an automatic Service Principal.
- Set the scope level to Machine Learning Workspace and connect to an existing Azure Machine Learning workspace you have access to.
- Grant access permission to all pipelines.
-
Give your service connection a name. You’ll use the name whenever you need to authenticate Azure DevOps to manage the Azure Machine Learning workspace.
- Sign into GitHub with an Org Admin account.
-
Create a GitHub repository.
Store credentials in GitHub:
- Go to your repository’s Settings.
- Navigate to the Secrets page.
- Select Actions.
- Add a new repository secret.
- Enter AZURE_CREDENTIALS as the name.
- Paste in the output JSON with the credentials and add the secret.
Use Azure DevOps to manage the Azure Machine Learning workspace.
- https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-azure-devops
PROTIP: Use RBAC roles to limit access to the workspace. Use a system to create and manage the workspace and its users so that there is automation with safeguards such as encryption and proper log analytics.
Create an Azure Machine Learning workspace
Python:
from azure.ai.ml.entities import Workspace workspace_name = "myworkspace" ws_basic = Workspace.create(name=workspace_name, subscription_id="00000000-0000-0000-0000-000000000000", resource_group="myresourcegroup", location="eastus", exist_ok=True) ml_client_workspace.begin_create(ws_basic)
Developers as end users
contributes to the project by collaborating on the development. Connects to the tools but has restricted access to the configuration of the DevOps environment.
MLFlow
MLflow.org is an open-source platform for the machine learning lifecycle. It supports Google’s TensorFlow Machine Learning toolchain at https://github.com/mlflow/mlflow
https://www.databricks.com/product/managed-mlflow
From Alfredo Deza and Noah Gift of Pragmatic AI Solutions:
- OReilly Course: MLOps Platforms From Zero: Databricks, MLFlow/MLRun/SKLearn
- MLOps Masterclass: Theory to DevOps to Cloud-native to AutoML
- Book: Practical MLOps Video of Foundations (Bash, make, AWS Cloud Shell, Cloud9, collaborate)
time shuf -n 1000 myfile.tsv myfile.1k.tsv
From Yaron Haviv and Noah Gift
- Implementing MLOps in the Enterprise September 2023
AutoML
AWS SageMaker
Video courses
https://www.deeplearning.ai/program/machine-learning-engineering-for-production-mlops/ 4 courses on Coursera
https://www.coursera.org/programs/mckinsey-learning-program-uedvm
- https://www.coursera.org/learn/introduction-to-machine-learning-in-production/home/welcome
- Machine Learning Data Lifecycle in Production
- Machine Learning Modeling Pipelines in Production
- Deploying Machine Learning Models in Production
- Robert Crowe, Instructor, TensorFlow Developer Engineer, Google
- Laurence Moroney, Instructor, Lead AI Advocate, Google
- Cristian Bartolomé Arámburu, Curriculum Developer, Founding Engineer, Pulsar
Rashid Ali - https://www.linkedin.com/pulse/configuring-azure-devops-selenium-ui-tests-rashid-ali/
https://towardsdatascience.com/machine-learning-in-production-why-you-should-care-about-data-and-concept-drift-d96d0bc907fb Concept and Data Drift
https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/ Monitoring ML Models
A Chat with Andrew on MLOps: From Model-centric to Data-centric (slides) on the DeepLearning.ai YouTube channel
Papers
Konstantinos, Katsiapis, Karmarkar, A., Altay, A., Zaks, A., Polyzotis, N., … Li, Z. (2020). Towards ML Engineering: A brief history of TensorFlow Extended (TFX). http://arxiv.org/abs/2010.02013
Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2020). Challenges in deploying machine learning: A survey of case studies. http://arxiv.org/abs/2011.09926
Sculley, D., Holt, G., Golovin, D., Davydov, E., & Phillips, T. (n.d.). Hidden technical debt in machine learning systems. Retrieved April 28, 2021, from Nips.c https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
LAB https://github.com/https-deeplearning-ai/MLEP-public/tree/main/course1/week1-ungraded-lab
Conditions: Speech recognition - accuracy vs human level performance (HLP)
- Clear Speech - 94% vs. 95% - 1% - 60% of data
- Car Noise - 89% vs. 93% - 4% - 4%
- People Noise - 87% vs. 89% - 2% - 30%
- Low Bandwidth - 70% vs. 70% (no difference)
https://www.coursera.org/learn/introduction-to-machine-learning-in-production/lecture/AjC2P/performance-auditing
- Performance on sub-sets of data (ethnicity, gender)
- How common are certain errors (FP, FN)
- Performance on rare cases
https://www.coursera.org/learn/introduction-to-machine-learning-in-production/lecture/B9eMQ/experiment-tracking Experiment tracking
- Algorithm/code versioning
- Dataset used
- Hyperparameters
- Results, with summary metrics/analysis
- Info needed to replicate results
- Resources used
References
https://blog.ml.cmu.edu/2020/08/31/3-baselines/ Establishing a baseline
https://techcommunity.microsoft.com/t5/azure-ai/responsible-machine-learning-with-error-analysis/ba-p/2141774 Error analysis
https://neptune.ai/blog/ml-experiment-tracking Experiment tracking
Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., … Anderljung, M. (n.d.). Toward trustworthy AI development: Mechanisms for supporting verifiable claims∗. Retrieved May 7, 2021 http://arxiv.org/abs/2004.07213v2
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. Retrieved from http://arxiv.org/abs/1912.02292
Data provenance refers to where the data comes from, and data lineage the sequence of processing steps applied to it.
Direct Labeling (aka Process Feedback) labels come from monitoring predictions, not from a “rater”. It needs to match prediction results with their corresponding original inference request.
Complicating production machine learning?
- Model Retraining driven by declining model performance.
- Labeling throug Weak Supervision
TFDV (TensorFlow Data Validation) DEFINITIONS: https://www.coursera.org/learn/machine-learning-data-lifecycle-in-production/lecture/OihrT/detecting-data-issues Drift refers to changes in data over time, such as data collected once a day. Seasonality.
Concept (covariant) drift
Skew refers to the difference between to static versions, or different sources, such as training set and serving set.
- Schema skew such as text to numeric
- Distribution skew of features from different data
- Feature skew
Detect skew by comparing baseline training and serving data
References:
-
https://cd.foundation/blog/2020/02/11/announcing-the-cd-foundation-mlops-sig/ MLops
-
https://medium.com/@karpathy/software-2-0-a64152b37c35 Data 1st class citizen
-
https://pair.withgoogle.com/chapter/data-collection/ Runners app
-
https://developers.google.com/machine-learning/guides/rules-of-ml Rules of ML
-
https://ai.googleblog.com/2018/09/introducing-inclusive-images-competition.html Bias in datasets
-
https://www.elastic.co/logstash Logstash
-
https://www.fluentd.org/ Fluentd
-
https://cloud.google.com/logging/ Google Cloud Logging
-
AWS ElasticSearch
-
https://azure.microsoft.com/en-us/services/monitor/ Azure Monitor
-
https://blog.tensorflow.org/2018/09/introducing-tensorflow-data-validation.html TFDV
-
https://en.wikipedia.org/wiki/Chebyshev_distance Chebyshev distance