Databricks

Create Analytics visualization dashboards pulling from Datalakes and DeltaLakes SaaS on Azure and AWS, coding ApacheSpark SQL, Python Notebooks, low-code AutoML, and MLFlow

Overview

This article aims to avoid “salesy” generalizations to present a deep yet succinct hands-on tutorial so you get perficient quickly.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

Databricks, the company

Databricks is headquartered in San Francisco with offices around the world.

Databricks calls itself the “Data + AI” company. Databricks is on a mission to “simplify and democratize data and AI, helping data teams solve the world’s toughest problems.” Databricks boasts hundreds of global partners, including Microsoft, Amazon, Tableau, Informatica, Cap Gemini and Booz Allen Hamilton, etc.

With origins in academia and the open-source community, the company was founded in 2013 by the original creators of Apache Spark™, Delta Lake and MLflow. Built on a modern Lakehouse (Medallion) architecture in the cloud, Databricks combines the best of database warehouses and data lakes (file mangemment) to offer an open and unified platform for data and AI.

The company was founded by the original creators of Apache Spark to provide personas data scientists and data engineers a cloud-based vendor-managed platform to easily run analytics in a scalable manner. Databricks provides a managed Spark cloud service along with a platform for managing the full data analytics lifecycle using Python Jupyter notebooks for interactive data exploration and dashboards for sharing visualizations. Jobs for scheduling and automating workflows. Added to the platform are libraries for machine learning.

https://docs.databricks.com/en/index.html
Admin Guide
YouTube channel
Databricks Essentials
Intro by the Seattle Data Guy
Databricks for Data Engineers

Databricks came up with the word “Lakehouse” architecture that combines a data warehouse and data lake to offer an open and unified cloud platform for data and AI.

Apache Spark
Community Edition
Competitors
Architecture
Access
Samples
Apache Spark
CLI
GUI
Compute
Clusters
Notebooks
Dashboards (samples)
Databricks Platform
Databricks Runtime
Databricks Workspace
Networking (VPCs)
Storage
Catalog
Data Pipelines Workflows
Monitoring
Cluster Management
Jobs
Queries (SQL, Scala)
Security
ACID
Data Governance
Machine Learning
Certifications

Competitors

Databricks competes with integrated cloud-based multi-region delta lakehouses with:

Snowflake
Microsoft Fabric
AWS
Google Cloud Platform (GCP)
Fivetran
talend

Feature	Traditional	DataLake	DeltaLake
Persona emphasis	Data Engineer	Data Analyst	Citizen
Handle Relational data structure	Yes	Yes
Handle Semi-structured data	No	Yes
Handle Unstructured (pdf, audio, photo, video, etc.)	No	Yes
Processing	ETL	ELT
Streaming	No	Yes

Architecture

Serverless SQL Compute has elastic scaling, auto-backups, patched and upgraded.

Streaming:

Delta Lake Z-Ordering by eventType to skip data blocks not needed, to reduce I/O and improve performance:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Samples

Databricks claims as customers more than five thousand organizations worldwide — including Shell, Comcast, CVS Health, HSBC, T-Mobile and Regeneron.

New York City Taxi data 2009-2020+

https://www.kaggle.com/c/nyc-taxi-trip-duration *https://www.kaggle.com/datasets/microize/newyork-yellow-taxi-trip-data-2020-2019
https://learn.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets
https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/demo-data-nyctaxi-in-sql?view=sql-server-ver16
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

SELECT * FROM samples.nyctaxi.trips limit 1000

Apache Spark

CLI

Install the Databricks CLI on your local machine to run commands against your Databricks workspace.

brew tap databricks/tap;brew install databricks; which databricks ; databricks –help
databricks configure –token ???
databricks clusters list # IDs
Error: cannot load Databricks config file: no configuration file found at /Users/wilsonmar/.databrickscfg; please create one first
databricks clusters get 1122-123456-abc123 | jq -r .name
databricks clusters start 1122-123456-abc123

GUI

Community Edition

Begin with a Community Edition Public environment for up to 3 users to view:

https://databricks.com/try-platform

You can run basic notebooks without collaboration on its single 15GB cluster with no worker nodes.

Start your 14-day free trial after you’re familiar with the product:
- User management, SSO, security
- Data management
- Integrations with Scikit-learn, Tensorflow, Keras, Spark, DataLake, MLFlow
- Integrations with BI tools Tableau, Qlik, Looker, PowerBI
- REST APIs
- Collaboration
- Scheduler
- Dashboards
- Configuring & scaling clusters
Confirm email.

Access

Roles - Persona-based use cases

IAM SSO

There is one metastore of metadata per deployment. Within that can be several Unity Catalogs for fine-grained access control to data in Delta Lake tables by storing user management

impact analysis of data access patterns and lineage.

Delta Sharing with external organizations, without copying data, with centralized admin control to clean rooms via REST protocol.

Data plane: Clusters

Control plane: Web app, Configurations, Notebooks, repos, DBSQL, Cluster manager

Apply Object Security

IntelliJ for Databricks Go to https://www.jetbrains.com/idea/download/#section=windows and download the Community Edition for Windows. Install it.

go mod init sample
go mod edit -require github.com/databricks/databricks-sdk-go@latest
touch main.go
(paste in code function from ???)
go mod tidy
go mod vendor
go run main.go

Install Databricks plugin from https://plugins.jetbrains.com/plugin/11020-databricks

Rstudio

Get token from Databricks
DATABRICKS_TOKEN=personal-access-token
DATABRICKS_HOST=workspace url

Compute

Clusters

Policies

Runtimes:

Data Engineering & ML: Standard, ML

LakeHouse Photon execution engine and Photon Writer handle Parquet-formatted files 2x faster than Spark instructions, as measured by TPC benchmarks on DBR versions.

Analytics & BI: SQL Analytics, SQL Analytics (Delta Lake), SQL Analytics (Delta Engine)

Cluster Create, List, Start, Stop, Restart, Terminate

Unified Catalog, vs. Metastores Best Practices

Catalogs: Least Privilege
Connections: Limit visibility, access, Tag connections
Business Units: Dedicated sandbox for each unit. Centralize shared data. Discoverability of glossaries/hierarchy. Federate queries

Notebooks

GUI

Repos

Languages/Libraries: Pandas, Scikit-learn, Tensorflow, Keras, Spark, DataLake, MLFlow

Keyboard shortcuts: Shift+Option+Space for auto-complete

Execute, Share

Dashboards

databricks-menu-new.png

Networking

Storage

Catalog

Data Pipelines

Data Engineering Workflows: Job Runs, Data Ingestion, Delta Live Tables

Monitoring

Push Logging data

Concurrency Limits

Metrics : Server load distribution

CPU utilization

Cluster Management

Driver logs

Jobs

VIDEO: Jobs are defined within the Workflows menu by specifying Task Name, Type, Source, Path Cluster, Dependent libraries, Parameters (GB), Notifications, Retries:

Job & All-Purpose Cluster ???

Timeout, Concurrency, Schedule (Trigger Type = Continuous)

Config Auto Loader

Query Pipeline Events

Task Dependencies: Ingestion, etc ???

View Job History within Data Engineering Job Runs.

Handle Failures, Retries

Queries

databricks-menu-sql.png SQL Editor, Queries, Dashboards, Alerts, Query History, SQL Warehouses

Query Engine

Libraries

Transform Spark SQL

Catalog Explorer

Security

ACID

Transaction Logs for Time Travel restore

Data Governance

Table metadata

Vacuum Garbage Collect

Data quality: Detect Drift

Machine Learning

BOOK: Practical Machine Learning on Databricks - 244 pages by Debu Sinha November 2023
VIDEO: Assimilate Databricks ML Certification By Alfredo Deza and Noah Gift September 2022 0h 58m

Experiments, Features, Models, Serving

Certifications

CLASS: Databricks Data Engineer Associate Certification Prep in 2 Weeks
2h 21m VIDEO: Databricks Certified Data Engineer Associate By Alfredo Deza and Noah Gift Publisher:Pragmatic AI Solutions December 2023

Resources

BOOK: Distributed Data Systems with Azure Databricks By Alan Bernardo Palacio 4 stars May 2021 414 pages
Azure Databricks Cookbook By Phani Raj and Vinod Jaiswal Publisher:Packt Publishing September 2021 Write the first review 452 pages
Optimizing Databricks Workloads By Anirudh Kala, Anshul Bhatnagar and Sarthak Sarbahi Publisher:Packt Publishing December 2021 230 pages
BOOK: Business Intelligence with Databricks SQL By Vihag Gupta September 2022 348 pages
VIDEO: Doing MLOps with Databricks and MLFlow - Full Course By Alfredo Deza and Noah Gift August 2022 1h 39m covers Spark Dbmlops
VIDEO: MLOps Platforms From Zero: Databricks, MLFlow/MLRun/SKLearn By Alfredo Deza and Noah Gift March 2022 2h 26m

Wilson Mar