Create Analytics visualization dashboards pulling from Datalakes and DeltaLakes SaaS on Azure and AWS, coding ApacheSpark SQL, Python Notebooks, low-code AutoML, and MLFlow
Overview
This article aims to avoid “salesy” generalizations to present a deep yet succinct hands-on tutorial so you get perficient quickly.
NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.
Databricks, the company
Databricks is headquartered in San Francisco with offices around the world.
Databricks calls itself the “Data + AI” company. Databricks is on a mission to “simplify and democratize data and AI, helping data teams solve the world’s toughest problems.” Databricks boasts hundreds of global partners, including Microsoft, Amazon, Tableau, Informatica, Cap Gemini and Booz Allen Hamilton, etc.
With origins in academia and the open-source community, the company was founded in 2013 by the original creators of Apache Spark™, Delta Lake and MLflow. Built on a modern Lakehouse architecture in the cloud, Databricks combines the best of data warehouses and data lakes to offer an open and unified platform for data and AI.
The company was founded by the original creators of Apache Spark to provide personas data scientists and data engineers</strong> a cloud-based vendor-managed platform to easily run analytics in a scalable manner. Databricks provides a managed Spark cloud service along with a platform for managing the full data analytics lifecycle using Python Jupyter notebooks for interactive data exploration and dashboards for sharing visualizations. Jobs for scheduling and automating workflows. Added to the platform are libraries for machine learning.
- https://docs.databricks.com/en/index.html
- YouTube channel
- Databricks Essentials
- Intro by the Seattle Data Guy
- Databricks for Data Engineers
Databricks came up with the word “Lakehouse” architecture that combines a data warehouse and data lake to offer an open and unified cloud platform for data and AI.
- Apache Spark
- Community Edition
- Competitors
- Architecture
- Samples
- CLI
- Compute
- Dashboards (samples)
- Databricks Platform
- Databricks Runtime
- Networking (VPCs)
- Storage
-
Data Pipelines Workflows
- Monitoring
- Jobs
-
Queries (SQL, Scala)
- Security
- ACID
- Data Governance
- Certifications
Competitors
Databricks competes with integrated cloud-based multi-region delta lakehouses with:
- Snowflake
- Microsoft Fabric
- AWS
- Google Cloud Platform (GCP)
- Fivetran
- talend
Feature | Traditional | DataLake | DeltaLake |
---|---|---|---|
Persona emphasis | Data Engineer | Data Analyst | Citizen |
Handle Relational data structure | Yes | Yes | |
Handle Semi-structured data | No | Yes | |
Handle Unstructured (pdf, audio, photo, video, etc.) | No | Yes | |
Processing | ETL | ELT | |
Streaming | No | Yes |
Architecture
Serverless SQL Compute has elastic scaling, auto-backups, patched and upgraded.
Delta Lake Z-Ordering by eventType to skip data blocks not needed, to reduce I/O and improve performance:
OPTIMIZE events WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)
Samples
Databricks claims as customers more than five thousand organizations worldwide — including Shell, Comcast, CVS Health, HSBC, T-Mobile and Regeneron.
New York City Taxi data 2009-2020+
- https://www.kaggle.com/c/nyc-taxi-trip-duration *https://www.kaggle.com/datasets/microize/newyork-yellow-taxi-trip-data-2020-2019
- https://learn.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets
- https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/demo-data-nyctaxi-in-sql?view=sql-server-ver16
- https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
SELECT * FROM samples.nyctaxi.trips limit 1000
Apache Spark
CLI
Install the Databricks CLI on your local machine to run commands against your Databricks workspace.
- brew tap databricks/tap;brew install databricks; which databricks ; databricks –help
- databricks configure –token ???
- databricks clusters list # IDs
- Error: cannot load Databricks config file: no configuration file found at /Users/wilsonmar/.databrickscfg; please create one first
- databricks clusters get 1122-123456-abc123 | jq -r .name
- databricks clusters start 1122-123456-abc123
GUI
Community Edition
-
Begin with a Community Edition Public environment for up to 3 users to view:
https://databricks.com/try-platform
You can run basic notebooks without collaboration on its single 15GB cluster with no worker nodes.
Start your 14-day free trial after you’re familiar with the product:
- User management, SSO, security
- Data management
- Integrations with Scikit-learn, Tensorflow, Keras, Spark, DataLake, MLFlow
- Integrations with BI tools Tableau, Qlik, Looker, PowerBI
- REST APIs
- Collaboration
- Scheduler
- Dashboards
- Configuring & scaling clusters
-
Confirm email.
Access
Roles - Persona-based use cases
IAM SSO
There is one metastore of metadata per deployment. Within that can be several Unity Catalogs for fine-grained access control to data in Delta Lake tables by storing user management
impact analysis of data access patterns and lineage.
Delta Sharing with external organizations, without copying data, with centralized admin control to clean rooms via REST protocol.
Data plane: Clusters
Control plane: Web app, Configurations, Notebooks, repos, DBSQL, Cluster manager
Apply Object Security
IntelliJ for Databricks Go to https://www.jetbrains.com/idea/download/#section=windows and download the Community Edition for Windows. Install it.
- go mod init sample
- go mod edit -require github.com/databricks/databricks-sdk-go@latest
- touch main.go
- (paste in code function from ???)
- go mod tidy
- go mod vendor
- go run main.go
Install Databricks plugin from https://plugins.jetbrains.com/plugin/11020-databricks
Rstudio
- Get token from Databricks
- DATABRICKS_TOKEN=personal-access-token
- DATABRICKS_HOST=workspace url
Compute
Clusters
Policies
Runtimes:
- Data Engineering & ML: Standard, ML
LakeHouse Photon execution engine and Photon Writer handle Parquet-formatted files 2x faster than Spark instructions, as measured by TPC benchmarks on DBR versions.
- Analytics & BI: SQL Analytics, SQL Analytics (Delta Lake), SQL Analytics (Delta Engine)
Cluster Create, List, Start, Stop, Restart, Terminate
Unified Catalog, vs. Metastores Best Practices
- Catalogs: Least Privilege
- Connections: Limit visibility, access, Tag connections
- Business Units: Dedicated sandbox for each unit. Centralize shared data. Discoverability of glossaries/hierarchy.
Federate queries
Notebooks
GUI
Repos
Languages/Libraries: Pandas, Scikit-learn, Tensorflow, Keras, Spark, DataLake, MLFlow
Keyboard shortcuts: Shift+Option+Space for auto-complete
Execute, Share
Dashboards
databricks-menu-new.png
Networking
Storage
Catalog
Data Pipelines
Data Engineering Workflows: Job Runs, Data Ingestion, Delta Live Tables
Monitoring
Push Logging data
Concurrency Limits
Metrics : Server load distribution
CPU utilization
Cluster Management
Driver logs
Jobs
VIDEO: Jobs are defined within the Workflows menu by specifying Task Name, Type, Source, Path Cluster, Dependent libraries, Parameters (GB), Notifications, Retries:
Job & All-Purpose Cluster ???
Timeout, Concurrency, Schedule (Trigger Type = Continuous)
Config Auto Loader
Query Pipeline Events
Task Dependencies: Ingestion, etc ???
View Job History within Data Engineering Job Runs.
Handle Failures, Retries
Queries
databricks-menu-sql.png SQL Editor, Queries, Dashboards, Alerts, Query History, SQL Warehouses
Query Engine
Libraries
Transform Spark SQL
Catalog Explorer
Security
ACID
Transaction Logs for Time Travel restore
Data Governance
Table metadata
Vacuum Garbage Collect
Data quality: Detect Drift
Machine Learning
-
>BOOK: Practical Machine Learning on Databricks - 244 pages by Debu Sinha November 2023
-
VIDEO: Assimilate Databricks ML Certification By Alfredo Deza and Noah Gift September 2022 0h 58m
Experiments, Features, Models, Serving
Certifications
-
CLASS: Databricks Data Engineer Associate Certification Prep in 2 Weeks
-
2h 21m VIDEO: Databricks Certified Data Engineer Associate By Alfredo Deza and Noah Gift Publisher:Pragmatic AI Solutions December 2023
Resources
-
BOOK: Distributed Data Systems with Azure Databricks By Alan Bernardo Palacio 4 stars May 2021 414 pages
-
Azure Databricks Cookbook By Phani Raj and Vinod Jaiswal Publisher:Packt Publishing September 2021 Write the first review 452 pages
-
Optimizing Databricks Workloads By Anirudh Kala, Anshul Bhatnagar and Sarthak Sarbahi Publisher:Packt Publishing December 2021 230 pages
-
BOOK: Business Intelligence with Databricks SQL By Vihag Gupta September 2022 348 pages
-
VIDEO: Doing MLOps with Databricks and MLFlow - Full Course By Alfredo Deza and Noah Gift August 2022 1h 39m covers Spark Dbmlops
-
VIDEO: MLOps Platforms From Zero: Databricks, MLFlow/MLRun/SKLearn By Alfredo Deza and Noah Gift March 2022 2h 26m