Wilson Mar bio photo

Wilson Mar

Hello!

Calendar YouTube Github

LinkedIn

Create Analytics visualization dashboards pulling from Datalakes and DeltaLakes SaaS on Azure and AWS, coding ApacheSpark SQL, Python Notebooks, low-code AutoML, and MLFlow

US (English)   Norsk (Norwegian)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Estonian   اَلْعَرَبِيَّةُ (Egypt Arabic)   Napali   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean

Overview

This article aims to avoid “salesy” generalizations to present a deep yet succinct hands-on tutorial so you get perficient quickly.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

Databricks, the company

Databricks is headquartered in San Francisco with offices around the world.

Databricks calls itself the “Data + AI” company. Databricks is on a mission to “simplify and democratize data and AI, helping data teams solve the world’s toughest problems.” Databricks boasts hundreds of global partners, including Microsoft, Amazon, Tableau, Informatica, Cap Gemini and Booz Allen Hamilton, etc.

With origins in academia and the open-source community, the company was founded in 2013 by the original creators of Apache Spark™, Delta Lake and MLflow. Built on a modern Lakehouse architecture in the cloud, Databricks combines the best of data warehouses and data lakes to offer an open and unified platform for data and AI.

The company was founded by the original creators of Apache Spark to provide personas data scientists and data engineers</strong> a cloud-based vendor-managed platform to easily run analytics in a scalable manner. Databricks provides a managed Spark cloud service along with a platform for managing the full data analytics lifecycle using Python Jupyter notebooks for interactive data exploration and dashboards for sharing visualizations. Jobs for scheduling and automating workflows. Added to the platform are libraries for machine learning.

Databricks came up with the word “Lakehouse” architecture that combines a data warehouse and data lake to offer an open and unified cloud platform for data and AI.

Competitors

Databricks competes with integrated cloud-based multi-region delta lakehouses with:

Feature Traditional DataLake DeltaLake
Persona emphasisData EngineerData AnalystCitizen
Handle Relational data structureYesYes
Handle Semi-structured dataNoYes
Handle Unstructured (pdf, audio, photo, video, etc.)NoYes
ProcessingETLELT
StreamingNoYes

Architecture

databricks-arch-3840x2400.png

Serverless SQL Compute has elastic scaling, auto-backups, patched and upgraded.

Streaming: databricks-streaming-1850x847.png

Delta Lake Z-Ordering by eventType to skip data blocks not needed, to reduce I/O and improve performance:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Samples

Databricks claims as customers more than five thousand organizations worldwide — including Shell, Comcast, CVS Health, HSBC, T-Mobile and Regeneron.

New York City Taxi data 2009-2020+

  • https://www.kaggle.com/c/nyc-taxi-trip-duration *https://www.kaggle.com/datasets/microize/newyork-yellow-taxi-trip-data-2020-2019
  • https://learn.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets
  • https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/demo-data-nyctaxi-in-sql?view=sql-server-ver16
  • https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

SELECT * FROM samples.nyctaxi.trips limit 1000

Apache Spark

CLI

Install the Databricks CLI on your local machine to run commands against your Databricks workspace.

  • brew tap databricks/tap;brew install databricks; which databricks ; databricks –help
  • databricks configure –token ???
  • databricks clusters list # IDs
  • Error: cannot load Databricks config file: no configuration file found at /Users/wilsonmar/.databrickscfg; please create one first
  • databricks clusters get 1122-123456-abc123 | jq -r .name
  • databricks clusters start 1122-123456-abc123

GUI

databricks-menu-910x1788.png

Community Edition

  1. Begin with a Community Edition Public environment for up to 3 users to view:

    https://databricks.com/try-platform

    You can run basic notebooks without collaboration on its single 15GB cluster with no worker nodes.

    Start your 14-day free trial after you’re familiar with the product:

    • User management, SSO, security
    • Data management
    • Integrations with Scikit-learn, Tensorflow, Keras, Spark, DataLake, MLFlow
    • Integrations with BI tools Tableau, Qlik, Looker, PowerBI
    • REST APIs
    • Collaboration
    • Scheduler
    • Dashboards
    • Configuring & scaling clusters

  2. Confirm email.

Access

Roles - Persona-based use cases

IAM SSO

There is one metastore of metadata per deployment. Within that can be several Unity Catalogs for fine-grained access control to data in Delta Lake tables by storing user management

databricks-metastore-3058x2156.png

impact analysis of data access patterns and lineage.

Delta Sharing with external organizations, without copying data, with centralized admin control to clean rooms via REST protocol.

Data plane: Clusters

Control plane: Web app, Configurations, Notebooks, repos, DBSQL, Cluster manager

Apply Object Security

IntelliJ for Databricks Go to https://www.jetbrains.com/idea/download/#section=windows and download the Community Edition for Windows. Install it.

  • go mod init sample
  • go mod edit -require github.com/databricks/databricks-sdk-go@latest
  • touch main.go
  • (paste in code function from ???)
  • go mod tidy
  • go mod vendor
  • go run main.go

Install Databricks plugin from https://plugins.jetbrains.com/plugin/11020-databricks

Rstudio

  • Get token from Databricks
  • DATABRICKS_TOKEN=personal-access-token
  • DATABRICKS_HOST=workspace url

Compute

Clusters

Policies

Runtimes:

  • Data Engineering & ML: Standard, ML

LakeHouse Photon execution engine and Photon Writer handle Parquet-formatted files 2x faster than Spark instructions, as measured by TPC benchmarks on DBR versions.

  • Analytics & BI: SQL Analytics, SQL Analytics (Delta Lake), SQL Analytics (Delta Engine)

Cluster Create, List, Start, Stop, Restart, Terminate

Unified Catalog, vs. Metastores Best Practices

  • Catalogs: Least Privilege
  • Connections: Limit visibility, access, Tag connections
  • Business Units: Dedicated sandbox for each unit. Centralize shared data. Discoverability of glossaries/hierarchy. Federate queries

Notebooks

GUI

Repos

Languages/Libraries: Pandas, Scikit-learn, Tensorflow, Keras, Spark, DataLake, MLFlow

Keyboard shortcuts: Shift+Option+Space for auto-complete

Execute, Share

Dashboards

databricks-menu-new.png

Networking

Storage

Catalog

Data Pipelines

Data Engineering Workflows: Job Runs, Data Ingestion, Delta Live Tables

Monitoring

Push Logging data

Concurrency Limits

Metrics : Server load distribution

CPU utilization

Cluster Management

Driver logs

Jobs

VIDEO: Jobs are defined within the Workflows menu by specifying Task Name, Type, Source, Path Cluster, Dependent libraries, Parameters (GB), Notifications, Retries:

databricks-jobs-2674x1888.png

Job & All-Purpose Cluster ???

Timeout, Concurrency, Schedule (Trigger Type = Continuous)

Config Auto Loader

Query Pipeline Events

Task Dependencies: Ingestion, etc ???

View Job History within Data Engineering Job Runs.

Handle Failures, Retries

Queries

databricks-menu-sql.png SQL Editor, Queries, Dashboards, Alerts, Query History, SQL Warehouses

Query Engine

Libraries

Transform Spark SQL

Catalog Explorer

Security

ACID

Transaction Logs for Time Travel restore

Data Governance

Table metadata

Vacuum Garbage Collect

Data quality: Detect Drift

Machine Learning

databricks-ml-1540x729.png

  • >BOOK: Practical Machine Learning on Databricks - 244 pages by Debu Sinha November 2023

  • VIDEO: Assimilate Databricks ML Certification By Alfredo Deza and Noah Gift September 2022 0h 58m

Experiments, Features, Models, Serving

Certifications

databricks-menu-ml.png

  • CLASS: Databricks Data Engineer Associate Certification Prep in 2 Weeks

  • 2h 21m VIDEO: Databricks Certified Data Engineer Associate By Alfredo Deza and Noah Gift Publisher:Pragmatic AI Solutions December 2023

Resources

  • BOOK: Distributed Data Systems with Azure Databricks By Alan Bernardo Palacio 4 stars May 2021 414 pages

  • Azure Databricks Cookbook By Phani Raj and Vinod Jaiswal Publisher:Packt Publishing September 2021 Write the first review 452 pages

  • Optimizing Databricks Workloads By Anirudh Kala, Anshul Bhatnagar and Sarthak Sarbahi Publisher:Packt Publishing December 2021 230 pages

  • BOOK: Business Intelligence with Databricks SQL By Vihag Gupta September 2022 348 pages

  • VIDEO: Doing MLOps with Databricks and MLFlow - Full Course By Alfredo Deza and Noah Gift August 2022 1h 39m covers Spark Dbmlops

  • VIDEO: MLOps Platforms From Zero: Databricks, MLFlow/MLRun/SKLearn By Alfredo Deza and Noah Gift March 2022 2h 26m