Wilson Mar bio photo

Wilson Mar

Hello!

Email me Calendar Skype call

LinkedIn Twitter Gitter Instagram Youtube

Github Stackoverflow Pinterest

VMs, Scale Sets, App Services, Websites, Function Apps, Logic Apps, Docker Containers, AKS

US (English)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Cyrillic Russian   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean

Overview

This is the hands-on step-by-step tutorial I would give to a developer or administrator getting up and running managing data Azure cloud.

DP-900 exam prep

Earn the “Microsoft Certified: Azure Data Fundamentals” certification by passing the one $99 exam: DP-900 exam: Describe …

  • LEARN: Core data concepts (15-20%)

  • LEARN: how to work with relational data on Azure (25-30%)

  • LEARN: how to work with non-relational data on Azure (25-30%)

  • LEARN: an analytics workload on Azure (25-30%)

The Skillpipe associated with the one-day Microsoft live course DP-900T00 roughly covers the above topics using github.com/MicrosoftLearning/DP-900T00A-Azure-Data-Fundamentals/tree/master/Instructions which redirects to a free “Microsoft Learn Sandbox” Directory in Azure:

<a target=”_blank” href=”“https://docs.microsoft.com/en-us/learn/modules/explore-provision-deploy-relational-database-offerings-azure/7-exercise-provision-relational-azure-data-services”>01-Provision-Azure-relational-database-services.md (4 hours)</a> Databases (Community Edition):

  • Azure SQL Database
  • Azure Database for PostgreSQL
  • Azure Database for MySQL

02-Use-SQL-to-query-Azure-SQL-Database.md

03-Provision-non-relational-Azure-data-services.md

04-Upload-download-and-query-data-in-a-non-relational-data-store.md

At Pluralsight

  • Getting Started with Azure Data Workloads by Henry Been (henrybeen.nl)
    az-compute-vm-sqldb-598x614

At CloudAcademy: https://cloudacademy.com/learning-paths/dp-900-exam-preparation-microsoft-azure-data-fundamentals-2256/ has Knowledge checks.

Sample Exams


DP-100 exam prep


Microsoft Data Products Marketing

https://azure.microsoft.com/en-us/resources/videos/dev-stories-troy-hunt-video/

https://economicgraph.linkedin.com/

Types of data

OLTP = Data is stored one transaction at a time.

OLAP = data periodically loaded, aggregataed, stored in a cube.

  • Summary
  • Trend

A Data Lake holds raw data after ingestion.

A Data Warehouse holds

Polybase is file-based.

SSIS is also heterogenous

ADF

Azure Data Factory (ADF) is Heterogenous - it has over 100 different connectors to various other systems.

Linked service to Data Lake Store, Azure Databricks.

VIDEO:

Create SQL database using Portal GUI

  1. In the portal, get the SQL databases blade after pressing G+\ or clicking the Home (3 line icon) at the top-left of the Portal.
  2. ”+ Create” to “Create SQL database”. The menu:

    Basics Networking Security Additional settings Tags Review+Create

  3. Resource group:
  4. Database name: up to 128 characters, unique on same server.
  5. Server:
  6. Want to use SQL elastic pool? Leave default: “No”.

    Elastic pools have multiple Azure SQL Database instances share the same resources (memory, storage, processing). Elastic pools provide a simple and cost effective solution for managing the performance of multiple databases within a fixed budget. An elastic pool provides compute (eDTUs) and storage resources that are shared between all the databases it contains.

    Databases within a pool only use the resources they need, when they need them, within configurable limits. The price of a pool is based only on the amount of resources configured and is independent of the number of databases it contains.

    PROTIP: That means databases are charged for allocations, not usage.

  7. Compute + storage

Data Flows

For answers needed today and tomorrow …

Batch jobs - REMEMBER:

  • ETL = Extract, Transform, Load into SQL star databases with usage “schema on write” for faster read

  • ELT = Extract, Load, Transform = data saved as-is into NoSQL (document) databases with usage “schema on read” for greater scale and exploration

  • Hybrid - data ingested on-prem, transformed in the cloud

Stream Processing (real-time)

Data Lifecycle

Pipelines:

  1. Collection of data
  2. Preparation of collected data
  3. Ingestion of data into storage
  4. Processing or transformation of data into a usable form
  5. Analysis of transformed data

ACID Properties in Transactional data

Atomicity - each transaction is treated as a ingle unit, which is successful completely or failed completely. Back-off intermediate changes.

Consistency - transactions can only take the data in the database from one valid state to another.

Isolation - concurrent execution of transactions leave the database in the same state.

Durability - once a transaction has been committed, it remains committed.

Azure Data Factory on Portal GUI

  1. In the portal, click “+ Create a resource”, then in “Search services and Marketplace” type enough of “Data Factory” to select it from the drop-down list.

    az-data-fac-menu-734x410

  2. Click “Create” after confirming that it’s from “Microsoft”.

    Integrate data silos with Azure Data Factory, a service built for all data integration needs and skill levels. Easily construct ETL and ELT processes code-free within the intuitive visual environment, or write your own code. Visually integrate data sources using more than 90+ natively built and maintenance-free connectors at no added cost. Focus on your data - the serverless integration service does the rest.

    • No code or maintenance required to build hybrid ETL and ELT pipelines within the Data Factory visual environment
    • Cost-efficient and fully managed serverless cloud data integration tool that scales on demand
    • Azure security measures to connect to on-premises, cloud-based, and software-as-a-service apps with peace of mind
    • SSIS integration runtime to easily rehost on-premises SSIS packages in the cloud using familiar SSIS tools

  3. Resource group:
  4. Database name: up to 128 characters, unique on same server.
  5. Server:

ADF automates data movement and transformation (ETL).

ADF can spin up and down HDInsights clusters.

Process in Factory Resources:

  1. Pipeline
  2. Combine datasets (sources)
  3. Data flows: Select columns
  4. Write output to target datasets (using Power Query?)

See Pluralsight: “Building your First Data Pipeline in Azure Data Factory” by Emillio Melo

PowerBI

  1. PowerBI

See Pluralsight: “Building your First Power BI Report”

SQL

With Azure SQL Database, The SQL Managed Instance handles up to 8TB databases Microsoft takes care of updates to the operating system and MS SQL software.

Tier Max. Size Latency Avail. SLA
General Purpose 4TB (8TB for Managed Instance) 5-10 ms 99.99%
Hyperscale 100TB instant backups scales
Business Critical 4TB 1-2 ms (SSD) 99.995% in 4-node ZRS cluster

Hyperscale scales up and down quickly.

Azure Database for MySQL, PostgreSQL Microsoft also supports MySQL, MariaDB, and PostgreSQL.

Data Migration Assistant can recognize when SQL Server Stretch Database migrates on-prem. cold table rows to Azure (to avoid buying more on-prem. storage). On-prem. backups can then bypass cold table rows (and run quicker).

Elastic Pool doesn’t work in Hyperscale.

DOCS: Instead of DTU, which has a bundled measure for pricing compute, storage, and IO resources, the vCore-based pricing model has independent charges for compute, storage, and I/O.

With the Serverless Compute Tier, if there is no activity, it pauses the database and halts compute charges.

SQL Database achieves HA with “Always ON Availability Groups” tech from SQL Server, which makes Failover automatic (but takes 30 seconds).

Up to 4 replicas can become the primary, as long as secondaries have the same user authentication config. and firewall rules as the primary.

Backups by transaction log occur every 5-10 minutes. Backups are saved for 7 days by default (Basic plan), up to 35 days under Standard/Premium. Long-term Retention can be up to 10 years. Lowest RPO is one hour of data loss for RTO of up to 12 hours for geo-replication.

NoSQL Azure Synapse

VIDEO: How to configure Data Factory to ingest data for Azure Synapse Analytics.

Azure Synapse Analytics was rebranded from “Azure SQL Data Warehouse”.

Integrates with Apache Spark. (Spark jobs can also be run in Azure Databricks and Azure HDInsight)

Non-Relational CosmosDB

PROTIP: Apache’s Parquet file format generally performs better than CSV because it provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. It is called a “columnar” storage format similar to other columnar-storage file formats available in Hadoop (RCFile and ORC). So it is compatible with most data processing frameworks in the Hadoop environment. Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. References:

  • https://www.upsolver.com/blog/apache-parquet-why-use

Databricks

There is a class and certification specific to Databricks: https://docs.microsoft.com/en-us/learn/certifications/courses/dp-090t00

Earn the “Microsoft Certified: Azure Data Fundamentals” certification by passing the one $99 exam: DP-090 exam:

HDInsight “big data”

Batch-only (runs cannot be stopped).

But can scale up and down.

  • Kafka - stream processing message broker
  • Hadoop - Distributed file system
  • Spark - data cluster computing
  • Apache Storm - real-time analytics Stream computation

Azure Data Platform

Include:

  • Azure Data Factory (ADF)
  • Stream Analytics
  • Data Lake

Social

Azure Data Community lists blogs, websites, videos, podcasts, and meetups.

https://www.twitch.tv/425show

More about Azure

This is one of a series about Azure cloud: