Databricks – The Master Guide to Databricks Workspaces, Unity Catalog, Metastores, Storage, Compute, and Cost

Uncategorized

Who this guide is for

This guide is for people who are asking questions like:

  • What is a Databricks workspace?
  • What is Unity Catalog?
  • What is a metastore?
  • Why do I need both a workspace and a metastore?
  • Where is my data actually stored?
  • Which parts live in Databricks, and which parts live in AWS or Azure?
  • Which things cost money, and how?
  • When should my organization choose serverless or classic compute?

This guide explains the concepts in simple language first, then goes deeper with architecture, storage, cost, ownership, and adoption patterns.


Part 1: The simple mental model

The six most important terms

1. Workspace

A Databricks workspace is the place where users log in and work.

It is the UI and working environment where people:

  • open notebooks
  • run queries
  • create dashboards
  • attach compute
  • collaborate with teammates
  • manage files and code

Think of it as your office.

2. Unity Catalog

Unity Catalog is Databricks’ centralized system for organizing and governing data and AI assets.

It gives you:

  • one shared data namespace
  • permissions and access control
  • governance across workspaces
  • auditing and data sharing support

Think of it as the company rulebook plus directory for data.

3. Metastore

A metastore is the top-level root container inside Unity Catalog.

It is the root of the data organization hierarchy.

Think of it as the main library index.

4. Catalog

A catalog is a top-level business area inside the metastore.

Examples:

  • finance
  • sales
  • marketing
  • sandbox

Think of a catalog like a major department folder.

5. Schema

A schema is a subfolder inside a catalog.

Examples:

  • raw
  • staging
  • analytics
  • gold

Think of a schema like a subfolder inside a department.

6. Table

A table is an actual data object.

Example:

  • sales.analytics.orders

Think of a table as the actual dataset people query.


One sentence summary

  • Workspace = where people work
  • Unity Catalog = the governance system
  • Metastore = the root container in Unity Catalog
  • Catalog = major data area
  • Schema = subfolder in that area
  • Table = actual dataset

Part 2: The most important relationship

A lot of confusion disappears once you separate these two ideas:

  • Workspace = user working environment
  • Metastore = data governance root

A workspace is not the same as a metastore.

A workspace gets attached to a metastore.

That means:

  • users work in the workspace
  • the workspace uses the metastore to know which catalogs, schemas, tables, and permissions exist

Diagram: workspace and metastore relationship

flowchart LR
    U[Users] --> W[Databricks Workspace]
    W --> C[Compute]
    W --> M[Unity Catalog Metastore]
    M --> CAT[Catalog]
    CAT --> SCH[Schema]
    SCH --> TBL[Table]

Key idea

The workspace is where users run things.
The metastore is what gives structure and governance to the data those users see.


Part 3: Why both are needed

Why a workspace is needed

You need a workspace because users need a place to:

  • log in
  • write notebooks
  • run compute
  • create jobs
  • view dashboards
  • collaborate

Without a workspace, there is no user-facing working environment.

Why a metastore is needed

You need a metastore because data needs:

  • a standard namespace
  • permissions
  • centralized control
  • shared definitions across teams and workspaces

Without a metastore, users may still have compute and notebooks, but data governance becomes fragmented.

Very simple analogy

Workspace is the office

It has:

  • desks
  • screens
  • tools
  • people working

Metastore is the central company library catalog

It tells you:

  • what books exist
  • what sections exist
  • who can read them
  • how everything is organized

So:

  • office != library catalog
  • but the office uses the library catalog

Part 4: The full hierarchy

Data hierarchy in Unity Catalog

flowchart TD
    MS[Metastore]
    MS --> C1[Catalog: finance]
    MS --> C2[Catalog: sales]
    C1 --> S1[Schema: raw]
    C1 --> S2[Schema: analytics]
    C2 --> S3[Schema: raw]
    C2 --> S4[Schema: analytics]
    S2 --> T1[Table: invoices]
    S2 --> T2[View: monthly_summary]
    S4 --> T3[Table: orders]
    S4 --> T4[Volume: documents]

Example names

  • metastore: company-us-east-1
  • catalog: sales
  • schema: analytics
  • table: orders

Full table name:

  • sales.analytics.orders

That full name does not include the metastore name in everyday SQL usage.
The metastore sits above the catalog as the governance root.


Part 5: Where things are actually stored

This is where many people get lost.

Important distinction

The metastore stores metadata and governance

It stores things like:

  • object definitions
  • permissions
  • governance relationships
  • references to managed or external data

The metastore does not store all the raw data bytes itself

The actual data files for tables and volumes are stored in object storage.

That storage can be:

  • Databricks default storage
  • AWS S3
  • Azure Data Lake Storage
  • Google Cloud Storage

depending on setup.

Diagram: metadata versus actual data

flowchart LR
    W[Workspace] --> M[Metastore]
    M --> META[Metadata and Permissions]
    M --> CAT[Catalog/Schema/Table Definitions]
    M --> ST[Storage Location References]
    ST --> S3[AWS S3 or ADLS or GCS or Databricks Default Storage]
    S3 --> FILES[Actual Table and Volume Files]

What is metadata?

Metadata means information about the data, such as:

  • table name
  • schema columns
  • owner
  • permissions
  • table location
  • whether it is managed or external

What is actual data?

Actual data means:

  • Parquet files
  • Delta files
  • Iceberg files
  • documents in volumes
  • data files in cloud object storage

Part 6: Managed versus external data

This is one of the most useful ideas in Databricks.

Managed tables

A managed table means Databricks manages the table’s storage location based on the managed storage location configured at the metastore, catalog, or schema level.

Use managed tables when:

  • you want simpler lifecycle management
  • you want stronger governance
  • you want the easiest Databricks-native experience

External tables

An external table means the data already exists in your cloud storage, and you register it in Unity Catalog.

Use external tables when:

  • you already have data in S3 or ADLS
  • multiple systems share the same files
  • you want storage independence
  • your data lake existed before Databricks

Diagram: managed versus external

flowchart TD
    UC[Unity Catalog]
    UC --> MT[Managed Table]
    UC --> ET[External Table]

    MT --> ML[Managed Storage Location]
    ML --> OBJ1[Cloud Object Storage Managed by Databricks Rules]

    ET --> EX[External Location]
    EX --> OBJ2[Existing S3 or ADLS or GCS Path]

Rule of thumb

  • For greenfield adoption, managed tables are often simpler
  • For existing enterprise lake storage, external tables are common

Part 7: What is created where

Workspace: where it is created

A workspace is created in the Databricks account layer.

Depending on cloud and workspace type:

  • classic workspace
  • serverless workspace

A workspace is associated with a region and cloud platform.

Metastore: where it is created

A metastore is created at the Databricks account level and linked to one or more workspaces in the same region.

You typically create:

  • one metastore per region

not:

  • one metastore per user
  • one metastore per notebook
  • one metastore per team unless there is a strong isolation reason

Catalog: where it is created

Catalogs are created inside the metastore.

Examples:

  • finance
  • sales
  • sandbox
  • ml

Schema: where it is created

Schemas are created inside a catalog.

Examples:

  • raw
  • curated
  • analytics

Table: where it is created

Tables are created inside a schema.

Examples:

  • finance.analytics.invoices
  • sales.raw.orders

Part 8: Why workspace creation asks AWS account or serverless, but metastore creation does not

This is one of the most important points.

Workspace creation asks about AWS account or serverless because workspace creation is about infrastructure and compute model

When creating a workspace, Databricks needs to know:

  • where the workspace runs
  • whether the environment is classic or serverless
  • who owns more of the infrastructure
  • how root/workspace/default storage should behave

That is why AWS versus serverless matters at workspace creation.

Metastore creation does not ask AWS account or serverless in the same way because the metastore is not compute

Metastore creation is about:

  • governance
  • metadata
  • namespace
  • storage references for managed objects

So the metastore is about how data is governed, not how compute is provisioned.


Part 9: Which parts live in Databricks versus AWS or Azure

This depends on whether you use serverless or classic patterns.

Simplified ownership model

Usually in Databricks-managed space

  • serverless compute
  • serverless workspace runtime environment
  • Databricks control plane
  • default storage features
  • metadata services and governance services

Usually in customer cloud account

  • classic compute VMs
  • customer-managed S3 or ADLS storage
  • existing data lake storage
  • networking setup for classic workspace deployments

Diagram: ownership split

flowchart LR
    subgraph Databricks_Managed[Databricks Managed]
        DP[Control Plane]
        SC[Serverless Compute]
        DS[Default Storage]
        UC[Unity Catalog Services]
    end

    subgraph Customer_Cloud[Customer Cloud Account]
        CC[Classic Compute Resources]
        CS[Customer Object Storage]
        NW[Customer Networking]
    end

    W[Workspace] --> DP
    W --> SC
    W --> CC
    M[Metastore] --> UC
    M --> DS
    M --> CS

AWS examples

Can live in Databricks

  • serverless notebook compute
  • serverless jobs compute
  • serverless SQL warehouse infrastructure
  • default storage for serverless workspace and default catalog

Can live in AWS account

  • classic all-purpose clusters
  • classic job clusters
  • S3 buckets for external or managed storage
  • IAM roles and networking for customer-managed storage access

Azure examples

Can live in Databricks

  • serverless compute
  • Databricks-managed services
  • default storage scenarios

Can live in Azure subscription

  • classic compute resources
  • ADLS storage for managed or external data
  • managed identities or service principals for storage access

Part 10: Cost model – what costs money and how

First principle

Different parts of Databricks cost money in different ways.

Not everything costs the same way.

1. Workspace

Creating a workspace by itself is not usually the main cost driver.
The main costs come from what you use inside it.

Typical cost drivers connected to workspaces:

  • compute usage
  • storage usage
  • network and cloud services
  • premium features depending on plan

2. Metastore

A metastore itself is generally not the big cost driver.

A metastore mainly represents governance and metadata organization.

Costs appear when you use:

  • managed storage
  • cloud storage
  • compute that reads and writes data
  • serverless or classic workloads against that data

3. Compute

Compute is usually the biggest cost driver.

Serverless compute

You pay based on Databricks usage, usually measured through DBUs and serverless usage.

Examples:

  • serverless notebooks
  • serverless jobs
  • serverless SQL warehouses

Classic compute

You pay for:

  • Databricks DBUs
  • cloud infrastructure such as EC2 or Azure VMs
  • storage and networking around those compute resources

4. Storage

Storage costs depend on where the data lives.

Examples:

  • S3 charges in AWS
  • ADLS charges in Azure
  • Databricks default storage use for supported features and serverless workspace defaults

5. Data transfer and networking

Depending on setup, cloud networking and data transfer can also cost money.


Part 11: Cost by component

Cost table in plain English

ComponentMain purposeUsually costs by itself?Main cost type
WorkspaceUser environmentUsually not the main costPlatform usage around workspace
Unity CatalogGovernance systemNot usually the main costIndirect through storage and usage
MetastoreTop-level governance rootUsually low direct cost concernIndirect through data and compute
CatalogOrganizing dataNo meaningful direct cost aloneNone by itself
SchemaOrganizing dataNo meaningful direct cost aloneNone by itself
TableActual data objectYes, indirectlyStorage + compute to read/write
Serverless notebook computeInteractive notebook executionYesServerless DBU usage
SQL warehouseSQL analytics computeYesSQL/serverless usage
Classic clusterCustomer-cloud computeYesDBU + cloud VM/infrastructure
Storage locationHolds data filesYesS3/ADLS/GCS/default storage

Part 12: The three deployment patterns every organization should understand

Pattern A: Serverless-first organization

What it means

The organization prefers Databricks-managed serverless compute for notebooks, jobs, and SQL.

Good for

  • fast setup
  • low infrastructure burden
  • internal training
  • analytics teams that want simplicity
  • organizations with limited cloud-infra admin support

Typical characteristics

  • serverless workspace
  • Unity Catalog enabled
  • default storage for default catalog
  • optional customer cloud storage for additional catalogs

Pros

  • fastest time to value
  • minimal infra management
  • simpler operations
  • easier for many users

Cons

  • less low-level infrastructure control
  • some organizations prefer more customer-owned storage and networking
  • advanced customization may push teams toward classic resources

Pattern B: Classic customer-cloud-heavy organization

What it means

The organization uses classic compute in its own cloud account and stores data in customer-managed storage.

Good for

  • large enterprise data lake already exists
  • strong cloud platform team
  • strict networking requirements
  • high customization needs

Typical characteristics

  • classic workspace
  • Unity Catalog enabled
  • S3 or ADLS used heavily
  • external locations and customer-owned storage patterns

Pros

  • more control
  • easier alignment with existing cloud architecture
  • fits established enterprise landing zones

Cons

  • more setup complexity
  • more operations burden
  • slower onboarding for new teams

Pattern C: Hybrid organization

What it means

The organization uses both:

  • serverless for many interactive workloads
  • classic compute for special workloads
  • customer cloud storage for durable enterprise data

Good for

  • most medium and large organizations
  • teams adopting gradually
  • mixed analytics and engineering workloads

Pros

  • balance of simplicity and control
  • practical migration path
  • good for phased modernization

Cons

  • governance and FinOps must be clear
  • architecture becomes more complex if standards are weak

Part 13: A very practical “who does what” view

Account admin

Usually responsible for:

  • creating workspaces
  • creating or assigning metastores
  • enabling Unity Catalog
  • setting broad governance patterns

Workspace admin

Usually responsible for:

  • workspace-level permissions
  • compute access and policies
  • serverless usage policies
  • user onboarding inside the workspace

Data platform team

Usually responsible for:

  • storage design
  • external locations
  • catalog strategy
  • naming conventions
  • security and permissions

Data users

Usually responsible for:

  • creating notebooks
  • running queries
  • using approved catalogs/schemas
  • building dashboards and pipelines

Part 14: What gets shared across workspaces

This is a very important reason Unity Catalog exists.

If multiple workspaces use the same metastore

Then they can share the same:

  • catalogs
  • schemas
  • tables
  • volumes
  • permissions model

This means you might have:

  • Dev workspace
  • Test workspace
  • Prod workspace

all attached to the same metastore, or to separate metastores depending on your governance design.

Diagram: multiple workspaces sharing one metastore

flowchart TD
    W1[Dev Workspace] --> M[Regional Metastore]
    W2[Test Workspace] --> M
    W3[Prod Workspace] --> M

    M --> CAT1[Catalog: finance]
    M --> CAT2[Catalog: sales]
    M --> CAT3[Catalog: sandbox]

When to share one metastore

Use one metastore across multiple workspaces when:

  • they are in the same region
  • you want shared governance
  • teams need a common namespace
  • access control can be handled through permissions

When separate metastores may make sense

Use separate metastores when:

  • regions differ
  • legal or residency requirements differ
  • business units require stronger separation
  • platform governance intentionally isolates environments

Part 15: Where to create what in real life

Workspace creation checklist

Create a workspace when:

  • a new team needs its own working environment
  • you want separate admin boundaries
  • you want separate dev/test/prod environments
  • you need a new regional deployment

Metastore creation checklist

Create a metastore when:

  • a region does not yet have one
  • you intentionally need a separate governance root
  • legal or organization boundaries require isolation

Do not create a new metastore just because:

  • a new analyst joined
  • a new notebook is created
  • a new project starts
  • a new schema is needed

Catalog creation checklist

Create a catalog when:

  • you need a major logical data domain
  • you want department-level ownership
  • you need separate storage or governance boundaries

Examples:

  • finance
  • sales
  • marketing
  • sandbox

Schema creation checklist

Create a schema when:

  • you want a sub-domain inside a catalog
  • you want lifecycle stages like raw/staging/analytics
  • you want a team-specific area under a catalog

Table creation checklist

Create a table when:

  • you have actual structured data to manage and query

Part 16: Where the data is stored in serverless versus classic setups

Serverless-first example

flowchart LR
    U[User] --> W[Serverless Workspace]
    W --> SN[Serverless Notebook Compute]
    W --> M[Metastore]
    M --> C[Default Catalog]
    C --> T[Managed Table]
    T --> DS[Databricks Default Storage or Customer Cloud Storage]

Typical behavior

  • workspace is serverless
  • compute is Databricks-managed
  • default catalog may use default storage
  • additional catalogs may use customer cloud storage

Classic AWS example

flowchart LR
    U[User] --> W[Classic Workspace]
    W --> CL[Classic Compute in AWS Account]
    W --> M[Metastore]
    M --> C[Catalog]
    C --> T[Managed or External Table]
    T --> S3[S3 in Customer AWS Account]

Typical behavior

  • workspace exists in Databricks environment
  • classic compute runs in customer AWS account
  • data commonly lives in S3
  • metastore governs access to that data

Azure example

flowchart LR
    U[User] --> W[Azure Databricks Workspace]
    W --> CL[Classic or Serverless Compute]
    W --> M[Metastore]
    M --> C[Catalog]
    C --> T[Managed or External Table]
    T --> ADLS[Azure Data Lake Storage]

Part 17: Common misunderstandings and the correct version

Misunderstanding 1

“Workspace stores everything, so why do I need metastore?”

Correct version

Workspace stores the working environment and some workspace assets.
Unity Catalog metastore governs the enterprise data namespace and permissions.

Misunderstanding 2

“If I create a workspace, a separate metastore must be created for it.”

Correct version

Not always.
A workspace is often assigned to an existing regional metastore.

Misunderstanding 3

“If I use serverless, then Unity Catalog is not needed.”

Correct version

Serverless is a compute model.
Unity Catalog is the governance model.
They solve different problems.

Misunderstanding 4

“Metastore contains all actual data files.”

Correct version

Metastore contains metadata and governance.
Actual data files live in object storage.

Misunderstanding 5

“If I name a usage policy 5usd, Databricks will stop at 5 USD.”

Correct version

A serverless usage policy is mainly for attribution and tagging, not an automatic hard budget cap by default.


Part 18: Cost-conscious learning guide

If you are learning Databricks and want low cost, this is the safest path.

For cheapest learning

Use:

  • one workspace already provided by your org
  • the existing metastore already assigned
  • a sandbox catalog or schema if allowed
  • tiny sample datasets
  • minimal notebook runtime

Avoid creating:

  • unnecessary workspaces
  • unnecessary metastores
  • extra SQL warehouses
  • large classic clusters

Prefer for learning

  • serverless notebook with tiny sample data
  • very short sessions
  • notebook SQL instead of starting many tools

Avoid for learning

  • large data scans
  • long-running notebooks
  • large warehouses
  • repeated run-all sessions

Part 19: Adoption guide for organizations

Stage 1: Small team adoption

Typical pattern

  • one workspace
  • one regional metastore
  • one or two catalogs
  • mostly serverless or personal compute
  • small governance model

Good catalog design

  • sandbox
  • shared
  • analytics

Why this works

  • simple
  • low friction
  • fast onboarding

Stage 2: Department adoption

Typical pattern

  • dev and prod workspaces
  • shared regional metastore
  • business-domain catalogs
  • stronger permissions
  • serverless plus some classic jobs

Good catalog design

  • finance
  • sales
  • marketing
  • ml

Why this works

  • separates ownership by business area
  • still keeps centralized governance

Stage 3: Enterprise platform adoption

Typical pattern

  • multiple workspaces by environment and team
  • one metastore per region
  • clear storage architecture
  • managed tables for some workloads, external tables for others
  • standardized compute policies
  • strong FinOps and security practices

Good catalog design

Based on data domain, environment, regulatory boundary, or platform standards.

Why this works

  • scalable governance
  • supports multiple teams
  • avoids chaos from workspace-by-workspace data silos

Part 20: Recommended decision guide

Choose serverless workspace when

  • you want fast setup
  • you want less infrastructure work
  • your workloads fit serverless-supported patterns
  • you want easy training or sandbox environments

Choose classic-heavy setup when

  • you need more infrastructure control
  • your org already has a strong cloud landing zone
  • you require customer-owned networking and storage patterns
  • special workloads need custom compute behavior

Choose one metastore per region when

  • you want standard Databricks governance practice
  • multiple workspaces in that region should share the same namespace

Create a new catalog when

  • a major data domain needs separation
  • ownership needs to be clear
  • storage or governance boundaries differ

Create a new schema when

  • you need a logical sub-area under a catalog
  • raw/curated/analytics separation is needed

Create a new table when

  • you have actual data to store or register

Part 21: Beginner tutorial path

Tutorial 1: Understand the layers

Answer these five questions in your own environment:

  1. What workspace am I using?
  2. Is Unity Catalog enabled?
  3. Which metastore is attached?
  4. Which catalogs exist?
  5. Which compute types can I use?

Tutorial 2: Find the data hierarchy

Try to identify:

  • one catalog
  • one schema
  • one table

Example:

  • catalog = samples
  • schema = nyctaxi
  • table = trips

Tutorial 3: Create a simple sandbox structure

If your permissions allow it:

  • create catalog sandbox
  • create schema rajesh
  • create a small test table

Tutorial 4: Learn managed versus external

Practice with:

  • one managed table
  • one external table

Observe:

  • where each one points
  • who controls the storage path
  • how permissions are applied

Tutorial 5: Understand cost practically

Run:

  • one tiny serverless notebook query
  • one tiny SQL warehouse query

Then compare:

  • which compute was used
  • which billing view records it
  • what tags or usage policies appeared

Part 22: A complete end-to-end example

Imagine a company called Acme.

Their setup

  • Region: us-east-1
  • Workspaces: dev, prod
  • One regional metastore
  • Catalogs: finance, sales, sandbox
  • Schemas under sales: raw, analytics
  • Table: sales.analytics.orders

How it works

  1. Users log into the dev workspace.
  2. They attach a notebook to serverless compute.
  3. The workspace is already linked to the regional metastore.
  4. The metastore exposes the sales catalog.
  5. Inside that catalog, they query sales.analytics.orders.
  6. Unity Catalog checks permissions.
  7. The actual table files are read from object storage.
  8. Compute cost is generated by the serverless notebook run.
  9. Storage cost is generated by the cloud storage used for the table files.

Why this is powerful

  • workspace gives user experience
  • compute gives execution
  • metastore gives governance
  • storage gives persistence

All four layers work together.


Part 23: The shortest version possible

If you remember only this, remember this:

The core model

  • Workspace = where users work
  • Compute = where code runs
  • Unity Catalog = the governance system
  • Metastore = the root of that governance system
  • Catalog / Schema / Table = the data organization hierarchy
  • Storage = where the actual files live

The cost model

  • governance objects themselves are usually not the main cost
  • compute and storage are the real cost drivers

The cloud model

  • serverless = more Databricks-managed
  • classic = more customer-cloud-managed
  • data can live in Databricks default storage or in customer cloud storage depending on design

Part 24: Final recommended best practices

For individuals learning Databricks

  • do not create new metastores unless required
  • use the existing workspace and metastore
  • work in a sandbox catalog/schema
  • use small datasets and short notebook sessions

For small teams

  • keep one regional metastore
  • define a simple catalog strategy early
  • use serverless first where possible
  • avoid over-engineering storage on day one

For enterprises

  • standardize catalog naming
  • design storage intentionally
  • document when to use managed versus external
  • treat one metastore per region as the default starting point
  • use multiple workspaces where admin boundaries or lifecycle differences are needed
  • use FinOps tagging and usage monitoring early

Final summary

Databricks has several layers that work together:

  1. Workspace gives users a place to work.
  2. Compute runs notebooks, jobs, and queries.
  3. Unity Catalog governs data and AI assets.
  4. Metastore is the top-level root of Unity Catalog.
  5. Catalogs, schemas, and tables organize data.
  6. Storage holds the actual data files.

When people get confused, it is usually because they mix:

  • working environment
  • governance
  • storage
  • compute
  • cloud ownership

Once you keep those separate, the whole Databricks architecture becomes much easier to understand.


One-page cheat sheet

What it is

  • Workspace = office
  • Compute = engine
  • Unity Catalog = rulebook
  • Metastore = root directory
  • Catalog = department folder
  • Schema = subfolder
  • Table = dataset

Where it lives

  • Workspace = Databricks environment
  • Metastore = Databricks account-level governance object
  • Table files = object storage
  • Compute = Databricks-managed serverless or customer-cloud classic

What costs money

  • compute
  • storage
  • networking
  • some platform usage

What usually does not matter as a direct cost by itself

  • catalog
  • schema
  • metastore object itself

Leave a Reply