Know SQL to pass DP-900, DP-100, DP-300
- Microsoft Data Products Marketing
- DP-900 exam prep
- DP-300 Azure Data Engineer Associate
- SQL Server (IaaS in VM)
- Azure SQL (PaaS)
- Azure Data Platform
- Types of data
- ADF for ETL
- Create SQL database using Portal GUI
- Data Flows
- Data Lifecycle
- ACID Properties in Transactional data
- Azure Data Factory on Portal GUI
- SQL Server Management Studio (SSMS)
- Azure Data Studio
- Azure Synapse NoSQL
- Non-Relational CosmosDB
- Databricks DP-090
- HDInsight “big data”
- More about Azure
This is the hands-on step-by-step tutorial I would give to a developer or administrator getting up and running managing data Azure cloud.
NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.
Microsoft Data Products Marketing
This one set of notes were used to study for Microsoft’s three data-related Azure certification exams:
- DP-900: Microsoft Certified: Azure Data Fundamentals
- DP-300: Administering Relational Databases on Microsoft Azure
DP-900 exam prep
Earn the DP-900 “Microsoft Certified: Azure Data Fundamentals” certification by passing the one $99 exam. Describe …
LEARN: Core data concepts (15-20%)
LEARN: how to work with relational data on Azure (25-30%)
LEARN: how to work with non-relational data on Azure (25-30%)
LEARN: an analytics workload on Azure (25-30%)
The Skillpipe associated with the one-day Microsoft live course DP-900T00 roughly covers the above topics using github.com/MicrosoftLearning/DP-900T00A-Azure-Data-Fundamentals/tree/master/Instructions which redirects to a free “Microsoft Learn Sandbox” Directory in Azure:
<a target=”_blank” href=”“https://docs.microsoft.com/en-us/learn/modules/explore-provision-deploy-relational-database-offerings-azure/7-exercise-provision-relational-azure-data-services”>01-Provision-Azure-relational-database-services.md (4 hours)</a> Databases (Community Edition):
- Azure SQL Database
- Azure Database for PostgreSQL
- Azure Database for MySQL
- Getting Started with Azure Data Workloads by Henry Been (henrybeen.nl)
At CloudAcademy: https://cloudacademy.com/learning-paths/dp-900-exam-preparation-microsoft-azure-data-fundamentals-2256/ has Knowledge checks.
Sample DP-900 Exams:
Earn the DP-100 “Microsoft Certified: Azure Data Fundamentals” certification by passing the one $99 exam. Describe …
DP-300 Azure Data Engineer Associate
To be a Microsoft Certified: Azure Database Administrator Associate, pass the single DP-300. The textual LEARN tutorial on that page has a structure similar to Microsoft’s 4-day DP-300T00 class taught live by Paola Escoba. I highly recommend you to walk through their (free) lab instructions in English at STAR: github.com/MicrosoftLearning/DP-300T00-Administering-Relational-Databases-on-Azure (and other languages) by Graeme Malcolm. The class Modules and labs aka.ms/dp300labs.
Lab 1 Using the Azure Portal and SQL Server Management Studio</a> - explore the Azure Portal and use it to create an Azure VM with SQL Server 2019 installed. Connect to the virtual machine through RDP (Remote Desktop Protocol) and restore a database using SSMS (SQL Server Management Studio).
DP-300_02_lab - Lab 2 – Deploying PaaS databases - configure and subsequently implement security in the Azure Portal and within the AdventureWorks database. configure basic resources needed to deploy an Azure SQL Database with a Virtual Network Endpoint. Connectivity to the SQL Database will be validated using Azure Data Studio from the lab VM. Finally, an Azure Database for PostgreSQL will be created.
Plan and implement data platform resources (LEARN)
Implement a secure environment for a database service (INTRO, LEARN, Lab 3)
Monitor and optimize operational resources (LEARN, Lab 4) - scope out deliverables for a digital transformation project within AdventureWorks. Examining the Azure portal as well as other tools, determine how to utilize native tools to identify and resolve performance related issues. Identify fragmentation within the database as well as learn steps to resolve the issue appropriately.
Optimize query performance (LEARN, Lab 5) - evaluate a database design for problems with normalization, data type selection and index design. Run queries with suboptimal performance, examine the query plans, and attempt to make improvements within the AdventureWorks2017 database.
Perform automation of tasks (LEARN, Lab 6) - take the information gained in the lessons to configure and subsequently implement automate processes within AdventureWorks.
Plan and implement a High Availability and Disaster Recovery (HADR) environment (LEARN, Lab 7) - execute two main tasks: make Azure SQL Database geo-redundant, and backup to and restore from a URL which uses Azure.
Perform administration by using T-SQL
SQL Server (IaaS in VM)
Traditionally, SQL Server 2019 software run within a single Azure VM (IaaS) instance. This is still the approach for large (64TB) SQL databases.
SQL agent jobs back up directly to a URL linked to Azure blob storage. Azure provides the option to use geo-redundant storage (GRS) or read-access geo-redundant storage (RA-GRS) to ensure that backup files are stored safely across the geographic landscape.
Additionally, as part of the Azure SQL VM service provider, you can have your backups automatically managed by the platform.
SQL Server provides access to the underlying OS, but that also means you need to keep that OS updated. Additionally, the SQL Server IaaS Agent Extension reduces your administrative overhead:
- SQL Server automated backup
- SQL Server automated patching
- Azure Key Vault integration
When used in conjunction with Azure managed storage, a single Azure Virtual Machine provides three nines (99.9%) of high availability. That a downtime of no more than 8.77 hours each year.
In addition to Availability Groups for Virtual Machines for disaster recovery, SQL Server has two major options for high availability:
- Always On availability groups and
- Failover Cluster Instances.
- https://github.com/microsoft/sqlworkshops-sql2019workshop to learn about the latest innovations available in SQL Server 2019
Always On availability groups (AG)
Always On availability groups are implemented between two to nine SQL Server instances running on Azure virtual machines or across Azure or an on-premises data center.
Database transactions are committed to the primary replica, and then the transactions are sent to all secondary replicas.
Transactions are sent in either synchronously or asynchronously availability mode, based on physical distance between servers.
- If the workload requires the lowest possible latency or the secondary replicas are geographically spread apart, asynchronous availability mode is recommended.
- If the replicas are within the same Azure region and the applications can withstand some level of latency, synchronous commit mode should be considered. Synchronous mode ensures that each transaction is committed to one or more secondaries before allowing the application to continue.
Always On availability groups provide both high availability and disaster recovery, because a single availability group can support both synchronous and asynchronous availability modes. The unit of failover for an availability group is a group of databases, and not the entire instance.
SQL Server Failover Cluster instances
If you need to protect the entire instance, you could use a SQL Server Failover Cluster Instance (FCI), which provides high availability for an entire instance, in a single region. A FCI doesn’t provide disaster recovery without being combined with another feature like availability groups or log shipping. FCIs also require shared storage that can be provided on Azure by using shared file storage or using Storage Spaces Direct on Windows Server.
For Azure workloads, availability groups are the preferred solution for newer deployments, because the shared storage require of FCIs increases the complexity of deployments. However, for migrations from on-premises solutions, an FCI may be required for application support.
Azure SQL (PaaS)
VIDEO: Azure SQL for beginners
“Microsoft SQL Services” (for the cloud) was announced with Windows Azure in 2008.
In 2010, the “Azure SQL” PaaS was announced as a “cloud database offering that Microsoft provides as part of the Azure cloud computing platform. Unlike other editions of SQL Server, you do not need to provision hardware for, install or patch Azure SQL; Microsoft maintains the platform for you. You also do not need to architect a database installation for scalability, high availability, or disaster recovery as these features are provided automatically by the service.”
PaaS SQL is versionless.
In 2014 announced elastic database pools, vCore choices, business-critical deployments, hyperscale, and serverless architectures.
Tiers for performance and availability using vCore pricing model:
- Single Azure SQL Database
- SQL Database / SQL Managed Instance
- SQL Database / SQL Managed Instance pools
|Service Tier||Max. Size||Latency||Avail. SLA|
|General Purpose||4TB (8TB for Managed Instance)||5-10 ms||99.99%|
|Business Critical||4TB||1-2 ms (SSD)||99.995% in 4-node ZRS cluster|
With Azure SQL Database, the SQL Managed Instance handles up to 8TB databases.
Hyperscale scales up and down quickly.
“Azure Database for MySQL, PostgreSQL” also supports MariaDB.
Data Migration Assistant can recognize when SQL Server Stretch Database migrates on-prem. cold table rows to Azure (to avoid buying more on-prem. storage). On-prem. backups can then bypass cold table rows (and run quicker).
Elastic Pool doesn’t work in Hyperscale.
Azure SQL Pricing
The DTU (Database Transaction Unit) model isn’t available in Azure SQL Managed Instance.
DOCS: Instead of DTU, which has a bundled measure for pricing compute, storage, and IO resources, the vCore-based pricing model has independent charges for compute, storage, and I/O.
The vCore model also allows use of Azure Hybrid Benefit for SQL Server and/or reserved capacity (pay in advance) to save money. Neither of these options is available in the DTU model.
With the Serverless Compute Tier, if there is no activity, it pauses the database and halts compute charges.
SQL Database achieves HA with “Always ON Availability Groups” tech from SQL Server, which makes Failover automatic (but takes 30 seconds).
Up to 4 replicas can become the primary, as long as secondaries have the same user authentication config. and firewall rules as the primary.
Backups by transaction log occur every 5-10 minutes. Backups are saved for 7 days by default (Basic plan), up to 35 days under Standard/Premium. Long-term Retention can be up to 10 years. Lowest RPO is one hour of data loss for RTO of up to 12 hours for geo-replication.
- https://www.oreilly.com/library/view/pro-sql-server/9781484241288/ introduces SQL Server on Linux. In the process, it walks through topics that are fundamental to SQL Server.
Azure Data Platform
- Azure Data Factory (ADF)
- Stream Analytics
- A Data Lake holds raw data after ingestion. Gen2 big data analytics with Hadoop compatible access built on Azure Blob storage with a superset of POSIX permissions
- A Data Lake House (Databricks) makes use of Spark data warehouse
Types of data
OLTP = Data is stored one transaction at a time.
OLAP = data periodically loaded, aggregataed, stored in a cube.
A Data Warehouse
Polybase is file-based, retrieve data from Excel, etc.
SSIS is also heterogenous
File format types:
- Avro (row-based)
- Parquet (columnal-based)
ORC (Optimized Row Columnar) stores Hive data efficiently
- Binary (pdf)
- Delimited text (CSV)
- Excel (XML)
ADF for ETL
Azure Data Factory (ADF) is Heterogenous - it has over 100 different connectors to various other systems.
Linked service to Data Lake Store, Azure Databricks.
Create SQL database using Portal GUI
- In the portal, get the SQL databases blade after pressing G+\ or clicking the Home (3 line icon) at the top-left of the Portal.
”+ Create” to “Create SQL database”. The menu:
Basics Networking Security Additional settings Tags Review+Create
- Resource group:
- Database name: up to 128 characters, unique on same server.
Want to use SQL elastic pool? Leave default: “No”.
Elastic pools have multiple Azure SQL Database instances share the same resources (memory, storage, processing). Elastic pools provide a simple and cost effective solution for managing the performance of multiple databases within a fixed budget. An elastic pool provides compute (eDTUs) and storage resources that are shared between all the databases it contains.
Databases within a pool only use the resources they need, when they need them, within configurable limits. The price of a pool is based only on the amount of resources configured and is independent of the number of databases it contains.
PROTIP: That means databases are charged for allocations, not usage.
- Compute + storage
For answers needed today and tomorrow …
Batch jobs - REMEMBER:
ETL = Extract, Transform, Load into SQL star databases with usage “schema on write” for faster read
ELT = Extract, Load, Transform = data saved as-is into NoSQL (document) databases with usage “schema on read” for greater scale and exploration
Hybrid - data ingested on-prem, transformed in the cloud
Stream Processing (real-time)
- Collection of data
- Preparation of collected data
- Ingestion of data into storage
- Processing or transformation of data into a usable form
- Analysis of transformed data
ACID Properties in Transactional data
Atomicity - each transaction is treated as a ingle unit, which is successful completely or failed completely. Back-off intermediate changes.
Consistency - transactions can only take the data in the database from one valid state to another.
Isolation - concurrent execution of transactions leave the database in the same state.
Durability - once a transaction has been committed, it remains committed.
Azure Data Factory on Portal GUI
In the portal, click “+ Create a resource”, then in “Search services and Marketplace” type enough of “Data Factory” to select it from the drop-down list.
Click “Create” after confirming that it’s from “Microsoft”.
Integrate data silos with Azure Data Factory, a service built for all data integration needs and skill levels. Easily construct ETL and ELT processes code-free within the intuitive visual environment, or write your own code. Visually integrate data sources using more than 90+ natively built and maintenance-free connectors at no added cost. Focus on your data - the serverless integration service does the rest.
- No code or maintenance required to build hybrid ETL and ELT pipelines within the Data Factory visual environment
- Cost-efficient and fully managed serverless cloud data integration tool that scales on demand
- Azure security measures to connect to on-premises, cloud-based, and software-as-a-service apps with peace of mind
- SSIS integration runtime to easily rehost on-premises SSIS packages in the cloud using familiar SSIS tools
- Resource group:
- Database name: up to 128 characters, unique on same server.
ADF automates data movement and transformation (ETL).
ADF can spin up and down HDInsights clusters.
Process in Factory Resources:
- Combine datasets (sources)
- Data flows: Select columns
- Write output to target datasets (using Power Query?)
See Pluralsight: “Building your First Data Pipeline in Azure Data Factory” by Emillio Melo
See my PowerBI notes
See Pluralsight: “Building your First Power BI Report”
SQL Server Management Studio (SSMS)
SSMS is integrated to visualize and work with Azure SQL, including SQL Server in virtual machines, SQL managed instances, and SQL databases. When necessary, SSMS shows only options that work for a specific Azure service.
Azure Data Studio
Azure Data Studio is an open-source, cross-platform client GUI tool for querying and working with various Azure data sources, including SQL Server and Azure SQL. Its “notebooks” allows mixing runnable code cells and formatted text in one place.
Azure Synapse NoSQL
VIDEO: How to configure Data Factory to ingest data for Azure Synapse Analytics.
Azure Synapse Analytics was rebranded from “Azure SQL Data Warehouse”.
Integrates with Apache Spark. (Spark jobs can also be run in Azure Databricks and Azure HDInsight)
Synapse has a “Massively Parallel” engine of partitioned instances (sharding)
PROTIP: Apache’s Parquet file format generally performs better than CSV because it provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. It is called a “columnar” storage format similar to other columnar-storage file formats available in Hadoop (RCFile and ORC). So it is compatible with most data processing frameworks in the Hadoop environment. Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. References:
There is a class and certification specific to Databricks: https://docs.microsoft.com/en-us/learn/certifications/courses/dp-090t00
Earn the “Microsoft Certified: Azure Data Fundamentals” certification by passing the one $99 exam: DP-090 exam:
HDInsight “big data”
Batch-only (runs cannot be stopped).
But can scale up and down.
- Kafka - stream processing message broker
- Hadoop - Distributed file system
- Spark - data cluster computing
- Apache Storm - real-time analytics Stream computation
Azure Data Community lists blogs, websites, videos, podcasts, and meetups.
More about Azure
This is one of a series about Azure cloud:
- Azure cloud introduction
- Azure Cloud Onramp (Subscriptions, Portal GUI, CLI)
- RDP client to access servers
- Bash Windows using Microsoft’s WSL (Windows Subsystem for Linux)
- Microsoft PowerShell ecosystem
- Azure Cloud Powershell
- PowerShell DSC (Desired State Configuration)
- PowerShell Modules
- Azure Networking
- Azure Storage
- Azure Compute
- Azure cloud DevOps
- Dockerize apps
- Kubernetes container engine
- Hashicorp Vault and Consul for keeping secrets
- Hashicorp Terraform
- Microsoft AI in Azure cloud
- Azure Monitoring
- Azure KSQL (Kusto Query Language) for Azure Monitor, etc.
- Dynatrace cloud monitoring
- Cloud Performance testing/engineering
- Cloud JMeter