Databricks

Posted on April 5, 2026June 13, 2026 | by rajeshkumar

Who this guide is for

This guide is for people who are asking questions like:

What is a Databricks workspace?
What is Unity Catalog?
What is a metastore?
Why do I need both a workspace and a metastore?
Where is my data actually stored?
Which parts live in Databricks, and which parts live in AWS or Azure?
Which things cost money, and how?
When should my organization choose serverless or classic compute?

This guide explains the concepts in simple language first, then goes deeper with architecture, storage, cost, ownership, and adoption patterns.

Part 1: The simple mental model

The six most important terms

1. Workspace

A Databricks workspace is the place where users log in and work.

It is the UI and working environment where people:

open notebooks
run queries
create dashboards
attach compute
collaborate with teammates
manage files and code

Think of it as your office.

2. Unity Catalog

Unity Catalog is Databricks’ centralized system for organizing and governing data and AI assets.

It gives you:

one shared data namespace
permissions and access control
governance across workspaces
auditing and data sharing support

Think of it as the company rulebook plus directory for data.

3. Metastore

A metastore is the top-level root container inside Unity Catalog.

It is the root of the data organization hierarchy.

Think of it as the main library index.

4. Catalog

A catalog is a top-level business area inside the metastore.

Examples:

finance
sales
marketing
sandbox

Think of a catalog like a major department folder.

5. Schema

A schema is a subfolder inside a catalog.

Examples:

raw
staging
analytics
gold

Think of a schema like a subfolder inside a department.

6. Table

A table is an actual data object.

Example:

sales.analytics.orders

Think of a table as the actual dataset people query.

One sentence summary

Workspace = where people work
Unity Catalog = the governance system
Metastore = the root container in Unity Catalog
Catalog = major data area
Schema = subfolder in that area
Table = actual dataset

Part 2: The most important relationship

A lot of confusion disappears once you separate these two ideas:

Workspace = user working environment
Metastore = data governance root

A workspace is not the same as a metastore.

A workspace gets attached to a metastore.

That means:

users work in the workspace
the workspace uses the metastore to know which catalogs, schemas, tables, and permissions exist

Diagram: workspace and metastore relationship

flowchart LR
    U[Users] --> W[Databricks Workspace]
    W --> C[Compute]
    W --> M[Unity Catalog Metastore]
    M --> CAT[Catalog]
    CAT --> SCH[Schema]
    SCH --> TBL[Table]

Key idea

The workspace is where users run things.
The metastore is what gives structure and governance to the data those users see.

Part 3: Why both are needed

Why a workspace is needed

You need a workspace because users need a place to:

log in
write notebooks
run compute
create jobs
view dashboards
collaborate

Without a workspace, there is no user-facing working environment.

Why a metastore is needed

You need a metastore because data needs:

a standard namespace
permissions
centralized control
shared definitions across teams and workspaces

Without a metastore, users may still have compute and notebooks, but data governance becomes fragmented.

Very simple analogy

Workspace is the office

It has:

desks
screens
tools
people working

Metastore is the central company library catalog

It tells you:

what books exist
what sections exist
who can read them
how everything is organized

So:

office != library catalog
but the office uses the library catalog

Part 4: The full hierarchy

Data hierarchy in Unity Catalog

flowchart TD
    MS[Metastore]
    MS --> C1[Catalog: finance]
    MS --> C2[Catalog: sales]
    C1 --> S1[Schema: raw]
    C1 --> S2[Schema: analytics]
    C2 --> S3[Schema: raw]
    C2 --> S4[Schema: analytics]
    S2 --> T1[Table: invoices]
    S2 --> T2[View: monthly_summary]
    S4 --> T3[Table: orders]
    S4 --> T4[Volume: documents]

Example names

metastore: company-us-east-1
catalog: sales
schema: analytics
table: orders

Full table name:

sales.analytics.orders

That full name does not include the metastore name in everyday SQL usage.
The metastore sits above the catalog as the governance root.

Part 5: Where things are actually stored

This is where many people get lost.

Important distinction

The metastore stores metadata and governance

It stores things like:

object definitions
permissions
governance relationships
references to managed or external data

The metastore does not store all the raw data bytes itself

The actual data files for tables and volumes are stored in object storage.

That storage can be:

Databricks default storage
AWS S3
Azure Data Lake Storage
Google Cloud Storage

depending on setup.

Diagram: metadata versus actual data

flowchart LR
    W[Workspace] --> M[Metastore]
    M --> META[Metadata and Permissions]
    M --> CAT[Catalog/Schema/Table Definitions]
    M --> ST[Storage Location References]
    ST --> S3[AWS S3 or ADLS or GCS or Databricks Default Storage]
    S3 --> FILES[Actual Table and Volume Files]

What is metadata?

Metadata means information about the data, such as:

table name
schema columns
owner
permissions
table location
whether it is managed or external

What is actual data?

Actual data means:

Parquet files
Delta files
Iceberg files
documents in volumes
data files in cloud object storage

Part 6: Managed versus external data

This is one of the most useful ideas in Databricks.

Managed tables

A managed table means Databricks manages the table’s storage location based on the managed storage location configured at the metastore, catalog, or schema level.

Use managed tables when:

you want simpler lifecycle management
you want stronger governance
you want the easiest Databricks-native experience

External tables

An external table means the data already exists in your cloud storage, and you register it in Unity Catalog.

Use external tables when:

you already have data in S3 or ADLS
multiple systems share the same files
you want storage independence
your data lake existed before Databricks

Diagram: managed versus external

flowchart TD
    UC[Unity Catalog]
    UC --> MT[Managed Table]
    UC --> ET[External Table]

    MT --> ML[Managed Storage Location]
    ML --> OBJ1[Cloud Object Storage Managed by Databricks Rules]

    ET --> EX[External Location]
    EX --> OBJ2[Existing S3 or ADLS or GCS Path]

Rule of thumb

For greenfield adoption, managed tables are often simpler
For existing enterprise lake storage, external tables are common

Part 7: What is created where

Workspace: where it is created

A workspace is created in the Databricks account layer.

Depending on cloud and workspace type:

classic workspace
serverless workspace

A workspace is associated with a region and cloud platform.

Metastore: where it is created

A metastore is created at the Databricks account level and linked to one or more workspaces in the same region.

You typically create:

one metastore per region

not:

one metastore per user
one metastore per notebook
one metastore per team unless there is a strong isolation reason

Catalog: where it is created

Catalogs are created inside the metastore.

Examples:

finance
sales
sandbox
ml

Schema: where it is created

Schemas are created inside a catalog.

Examples:

raw
curated
analytics

Table: where it is created

Tables are created inside a schema.

Examples:

finance.analytics.invoices
sales.raw.orders

Part 8: Why workspace creation asks AWS account or serverless, but metastore creation does not

This is one of the most important points.

Workspace creation asks about AWS account or serverless because workspace creation is about infrastructure and compute model

When creating a workspace, Databricks needs to know:

where the workspace runs
whether the environment is classic or serverless
who owns more of the infrastructure
how root/workspace/default storage should behave

That is why AWS versus serverless matters at workspace creation.

Metastore creation does not ask AWS account or serverless in the same way because the metastore is not compute

Metastore creation is about:

governance
metadata
namespace
storage references for managed objects

So the metastore is about how data is governed, not how compute is provisioned.

Part 9: Which parts live in Databricks versus AWS or Azure

This depends on whether you use serverless or classic patterns.

Simplified ownership model

Usually in Databricks-managed space

serverless compute
serverless workspace runtime environment
Databricks control plane
default storage features
metadata services and governance services

Usually in customer cloud account

classic compute VMs
customer-managed S3 or ADLS storage
existing data lake storage
networking setup for classic workspace deployments

Diagram: ownership split

flowchart LR
    subgraph Databricks_Managed[Databricks Managed]
        DP[Control Plane]
        SC[Serverless Compute]
        DS[Default Storage]
        UC[Unity Catalog Services]
    end

    subgraph Customer_Cloud[Customer Cloud Account]
        CC[Classic Compute Resources]
        CS[Customer Object Storage]
        NW[Customer Networking]
    end

    W[Workspace] --> DP
    W --> SC
    W --> CC
    M[Metastore] --> UC
    M --> DS
    M --> CS

AWS examples

Can live in Databricks

serverless notebook compute
serverless jobs compute
serverless SQL warehouse infrastructure
default storage for serverless workspace and default catalog

Can live in AWS account

classic all-purpose clusters
classic job clusters
S3 buckets for external or managed storage
IAM roles and networking for customer-managed storage access

Azure examples

Can live in Databricks

serverless compute
Databricks-managed services
default storage scenarios

Can live in Azure subscription

classic compute resources
ADLS storage for managed or external data
managed identities or service principals for storage access

Part 10: Cost model – what costs money and how

First principle

Different parts of Databricks cost money in different ways.

Not everything costs the same way.

1. Workspace

Creating a workspace by itself is not usually the main cost driver.
The main costs come from what you use inside it.

Typical cost drivers connected to workspaces:

compute usage
storage usage
network and cloud services
premium features depending on plan

2. Metastore

A metastore itself is generally not the big cost driver.

A metastore mainly represents governance and metadata organization.

Costs appear when you use:

managed storage
cloud storage
compute that reads and writes data
serverless or classic workloads against that data

3. Compute

Compute is usually the biggest cost driver.

Serverless compute

You pay based on Databricks usage, usually measured through DBUs and serverless usage.

Examples:

serverless notebooks
serverless jobs
serverless SQL warehouses

Classic compute

You pay for:

Databricks DBUs
cloud infrastructure such as EC2 or Azure VMs
storage and networking around those compute resources

4. Storage

Storage costs depend on where the data lives.

Examples:

S3 charges in AWS
ADLS charges in Azure
Databricks default storage use for supported features and serverless workspace defaults

5. Data transfer and networking

Depending on setup, cloud networking and data transfer can also cost money.

Part 11: Cost by component

Cost table in plain English

Component	Main purpose	Usually costs by itself?	Main cost type
Workspace	User environment	Usually not the main cost	Platform usage around workspace
Unity Catalog	Governance system	Not usually the main cost	Indirect through storage and usage
Metastore	Top-level governance root	Usually low direct cost concern	Indirect through data and compute
Catalog	Organizing data	No meaningful direct cost alone	None by itself
Schema	Organizing data	No meaningful direct cost alone	None by itself
Table	Actual data object	Yes, indirectly	Storage + compute to read/write
Serverless notebook compute	Interactive notebook execution	Yes	Serverless DBU usage
SQL warehouse	SQL analytics compute	Yes	SQL/serverless usage
Classic cluster	Customer-cloud compute	Yes	DBU + cloud VM/infrastructure
Storage location	Holds data files	Yes	S3/ADLS/GCS/default storage

Part 12: The three deployment patterns every organization should understand

Pattern A: Serverless-first organization

What it means

The organization prefers Databricks-managed serverless compute for notebooks, jobs, and SQL.

Good for

fast setup
low infrastructure burden
internal training
analytics teams that want simplicity
organizations with limited cloud-infra admin support

Typical characteristics

serverless workspace
Unity Catalog enabled
default storage for default catalog
optional customer cloud storage for additional catalogs

Pros

fastest time to value
minimal infra management
simpler operations
easier for many users

Cons

less low-level infrastructure control
some organizations prefer more customer-owned storage and networking
advanced customization may push teams toward classic resources

Pattern B: Classic customer-cloud-heavy organization

What it means

The organization uses classic compute in its own cloud account and stores data in customer-managed storage.

Good for

large enterprise data lake already exists
strong cloud platform team
strict networking requirements
high customization needs

Typical characteristics

classic workspace
Unity Catalog enabled
S3 or ADLS used heavily
external locations and customer-owned storage patterns

Pros

more control
easier alignment with existing cloud architecture
fits established enterprise landing zones

Cons

more setup complexity
more operations burden
slower onboarding for new teams

Pattern C: Hybrid organization

What it means

The organization uses both:

serverless for many interactive workloads
classic compute for special workloads
customer cloud storage for durable enterprise data

Good for

most medium and large organizations
teams adopting gradually
mixed analytics and engineering workloads

Pros

balance of simplicity and control
practical migration path
good for phased modernization

Cons

governance and FinOps must be clear
architecture becomes more complex if standards are weak

Part 13: A very practical “who does what” view

Account admin

Usually responsible for:

creating workspaces
creating or assigning metastores
enabling Unity Catalog
setting broad governance patterns

Workspace admin

Usually responsible for:

workspace-level permissions
compute access and policies
serverless usage policies
user onboarding inside the workspace

Data platform team

Usually responsible for:

storage design
external locations
catalog strategy
naming conventions
security and permissions

Data users

Usually responsible for:

creating notebooks
running queries
using approved catalogs/schemas
building dashboards and pipelines

Part 14: What gets shared across workspaces

This is a very important reason Unity Catalog exists.

If multiple workspaces use the same metastore

Then they can share the same:

catalogs
schemas
tables
volumes
permissions model

This means you might have:

Dev workspace
Test workspace
Prod workspace

all attached to the same metastore, or to separate metastores depending on your governance design.

Diagram: multiple workspaces sharing one metastore

flowchart TD
    W1[Dev Workspace] --> M[Regional Metastore]
    W2[Test Workspace] --> M
    W3[Prod Workspace] --> M

    M --> CAT1[Catalog: finance]
    M --> CAT2[Catalog: sales]
    M --> CAT3[Catalog: sandbox]

When to share one metastore

Use one metastore across multiple workspaces when:

they are in the same region
you want shared governance
teams need a common namespace
access control can be handled through permissions

When separate metastores may make sense

Use separate metastores when:

regions differ
legal or residency requirements differ
business units require stronger separation
platform governance intentionally isolates environments

Part 15: Where to create what in real life

Workspace creation checklist

Create a workspace when:

a new team needs its own working environment
you want separate admin boundaries
you want separate dev/test/prod environments
you need a new regional deployment

Metastore creation checklist

Create a metastore when:

a region does not yet have one
you intentionally need a separate governance root
legal or organization boundaries require isolation

Do not create a new metastore just because:

a new analyst joined
a new notebook is created
a new project starts
a new schema is needed

Catalog creation checklist

Create a catalog when:

you need a major logical data domain
you want department-level ownership
you need separate storage or governance boundaries

Examples:

finance
sales
marketing
sandbox

Schema creation checklist

Create a schema when:

you want a sub-domain inside a catalog
you want lifecycle stages like raw/staging/analytics
you want a team-specific area under a catalog

Table creation checklist

Create a table when:

you have actual structured data to manage and query

Part 16: Where the data is stored in serverless versus classic setups

Serverless-first example

flowchart LR
    U[User] --> W[Serverless Workspace]
    W --> SN[Serverless Notebook Compute]
    W --> M[Metastore]
    M --> C[Default Catalog]
    C --> T[Managed Table]
    T --> DS[Databricks Default Storage or Customer Cloud Storage]

Typical behavior

workspace is serverless
compute is Databricks-managed
default catalog may use default storage
additional catalogs may use customer cloud storage

Classic AWS example

flowchart LR
    U[User] --> W[Classic Workspace]
    W --> CL[Classic Compute in AWS Account]
    W --> M[Metastore]
    M --> C[Catalog]
    C --> T[Managed or External Table]
    T --> S3[S3 in Customer AWS Account]

Typical behavior

workspace exists in Databricks environment
classic compute runs in customer AWS account
data commonly lives in S3
metastore governs access to that data

Azure example

flowchart LR
    U[User] --> W[Azure Databricks Workspace]
    W --> CL[Classic or Serverless Compute]
    W --> M[Metastore]
    M --> C[Catalog]
    C --> T[Managed or External Table]
    T --> ADLS[Azure Data Lake Storage]

Part 17: Common misunderstandings and the correct version

Misunderstanding 1

“Workspace stores everything, so why do I need metastore?”

Correct version

Workspace stores the working environment and some workspace assets.
Unity Catalog metastore governs the enterprise data namespace and permissions.

Misunderstanding 2

“If I create a workspace, a separate metastore must be created for it.”

Correct version

Not always.
A workspace is often assigned to an existing regional metastore.

Misunderstanding 3

“If I use serverless, then Unity Catalog is not needed.”

Correct version

Serverless is a compute model.
Unity Catalog is the governance model.
They solve different problems.

Misunderstanding 4

“Metastore contains all actual data files.”

Correct version

Metastore contains metadata and governance.
Actual data files live in object storage.

Misunderstanding 5

“If I name a usage policy 5usd, Databricks will stop at 5 USD.”

Correct version

A serverless usage policy is mainly for attribution and tagging, not an automatic hard budget cap by default.

Part 18: Cost-conscious learning guide

If you are learning Databricks and want low cost, this is the safest path.

For cheapest learning

Use:

one workspace already provided by your org
the existing metastore already assigned
a sandbox catalog or schema if allowed
tiny sample datasets
minimal notebook runtime

Avoid creating:

unnecessary workspaces
unnecessary metastores
extra SQL warehouses
large classic clusters

Prefer for learning

serverless notebook with tiny sample data
very short sessions
notebook SQL instead of starting many tools

Avoid for learning

large data scans
long-running notebooks
large warehouses
repeated run-all sessions

Part 19: Adoption guide for organizations

Stage 1: Small team adoption

Typical pattern

one workspace
one regional metastore
one or two catalogs
mostly serverless or personal compute
small governance model

Good catalog design

sandbox
shared
analytics

Why this works

simple
low friction
fast onboarding

Stage 2: Department adoption

Typical pattern

dev and prod workspaces
shared regional metastore
business-domain catalogs
stronger permissions
serverless plus some classic jobs

Good catalog design

finance
sales
marketing
ml

Why this works

separates ownership by business area
still keeps centralized governance

Stage 3: Enterprise platform adoption

Typical pattern

multiple workspaces by environment and team
one metastore per region
clear storage architecture
managed tables for some workloads, external tables for others
standardized compute policies
strong FinOps and security practices

Good catalog design

Based on data domain, environment, regulatory boundary, or platform standards.

Why this works

scalable governance
supports multiple teams
avoids chaos from workspace-by-workspace data silos

Part 20: Recommended decision guide

Choose serverless workspace when

you want fast setup
you want less infrastructure work
your workloads fit serverless-supported patterns
you want easy training or sandbox environments

Choose classic-heavy setup when

you need more infrastructure control
your org already has a strong cloud landing zone
you require customer-owned networking and storage patterns
special workloads need custom compute behavior

Choose one metastore per region when

you want standard Databricks governance practice
multiple workspaces in that region should share the same namespace

Create a new catalog when

a major data domain needs separation
ownership needs to be clear
storage or governance boundaries differ

Create a new schema when

you need a logical sub-area under a catalog
raw/curated/analytics separation is needed

Create a new table when

you have actual data to store or register

Part 21: Beginner tutorial path

Tutorial 1: Understand the layers

Answer these five questions in your own environment:

What workspace am I using?
Is Unity Catalog enabled?
Which metastore is attached?
Which catalogs exist?
Which compute types can I use?

Tutorial 2: Find the data hierarchy

Try to identify:

one catalog
one schema
one table

Example:

catalog = samples
schema = nyctaxi
table = trips

Tutorial 3: Create a simple sandbox structure

If your permissions allow it:

create catalog sandbox
create schema rajesh
create a small test table

Tutorial 4: Learn managed versus external

Practice with:

one managed table
one external table

Observe:

where each one points
who controls the storage path
how permissions are applied

Tutorial 5: Understand cost practically

Run:

one tiny serverless notebook query
one tiny SQL warehouse query

Then compare:

which compute was used
which billing view records it
what tags or usage policies appeared

Part 22: A complete end-to-end example

Imagine a company called Acme.

Their setup

Region: us-east-1
Workspaces: dev, prod
One regional metastore
Catalogs: finance, sales, sandbox
Schemas under sales: raw, analytics
Table: sales.analytics.orders

How it works

Users log into the dev workspace.
They attach a notebook to serverless compute.
The workspace is already linked to the regional metastore.
The metastore exposes the sales catalog.
Inside that catalog, they query sales.analytics.orders.
Unity Catalog checks permissions.
The actual table files are read from object storage.
Compute cost is generated by the serverless notebook run.
Storage cost is generated by the cloud storage used for the table files.

Why this is powerful

workspace gives user experience
compute gives execution
metastore gives governance
storage gives persistence

All four layers work together.

Part 23: The shortest version possible

If you remember only this, remember this:

The core model

Workspace = where users work
Compute = where code runs
Unity Catalog = the governance system
Metastore = the root of that governance system
Catalog / Schema / Table = the data organization hierarchy
Storage = where the actual files live

The cost model

governance objects themselves are usually not the main cost
compute and storage are the real cost drivers

The cloud model

serverless = more Databricks-managed
classic = more customer-cloud-managed
data can live in Databricks default storage or in customer cloud storage depending on design

Part 24: Final recommended best practices

For individuals learning Databricks

do not create new metastores unless required
use the existing workspace and metastore
work in a sandbox catalog/schema
use small datasets and short notebook sessions

For small teams

keep one regional metastore
define a simple catalog strategy early
use serverless first where possible
avoid over-engineering storage on day one

For enterprises

standardize catalog naming
design storage intentionally
document when to use managed versus external
treat one metastore per region as the default starting point
use multiple workspaces where admin boundaries or lifecycle differences are needed
use FinOps tagging and usage monitoring early

Final summary

Databricks has several layers that work together:

Workspace gives users a place to work.
Compute runs notebooks, jobs, and queries.
Unity Catalog governs data and AI assets.
Metastore is the top-level root of Unity Catalog.
Catalogs, schemas, and tables organize data.
Storage holds the actual data files.

When people get confused, it is usually because they mix:

working environment
governance
storage
compute
cloud ownership

Once you keep those separate, the whole Databricks architecture becomes much easier to understand.

One-page cheat sheet

What it is

Workspace = office
Compute = engine
Unity Catalog = rulebook
Metastore = root directory
Catalog = department folder
Schema = subfolder
Table = dataset

Where it lives

Workspace = Databricks environment
Metastore = Databricks account-level governance object
Table files = object storage
Compute = Databricks-managed serverless or customer-cloud classic

What costs money

compute
storage
networking
some platform usage

What usually does not matter as a direct cost by itself

catalog
schema
metastore object itself

Databricks – The Master Guide to Databricks Workspaces, Unity Catalog, Metastores, Storage, Compute, and Cost

Who this guide is for

Part 1: The simple mental model

The six most important terms

1. Workspace

2. Unity Catalog

3. Metastore

4. Catalog

5. Schema

6. Table

One sentence summary

Part 2: The most important relationship

Diagram: workspace and metastore relationship

Key idea

Part 3: Why both are needed

Why a workspace is needed

Why a metastore is needed

Very simple analogy

Workspace is the office

Metastore is the central company library catalog

Part 4: The full hierarchy

Data hierarchy in Unity Catalog

Example names

Part 5: Where things are actually stored

Important distinction

The metastore stores metadata and governance

The metastore does not store all the raw data bytes itself

Diagram: metadata versus actual data

What is metadata?

What is actual data?

Part 6: Managed versus external data

Managed tables

External tables

Diagram: managed versus external

Rule of thumb

Part 7: What is created where

Workspace: where it is created

Metastore: where it is created

Catalog: where it is created

Schema: where it is created

Table: where it is created

Part 8: Why workspace creation asks AWS account or serverless, but metastore creation does not

Workspace creation asks about AWS account or serverless because workspace creation is about infrastructure and compute model

Metastore creation does not ask AWS account or serverless in the same way because the metastore is not compute

Part 9: Which parts live in Databricks versus AWS or Azure

Simplified ownership model

Usually in Databricks-managed space

Usually in customer cloud account

Diagram: ownership split

AWS examples

Can live in Databricks

Can live in AWS account

Azure examples

Can live in Databricks

Can live in Azure subscription

Part 10: Cost model – what costs money and how

First principle

1. Workspace

2. Metastore

3. Compute

Serverless compute

Classic compute

4. Storage

5. Data transfer and networking

Part 11: Cost by component

Cost table in plain English

Part 12: The three deployment patterns every organization should understand

Pattern A: Serverless-first organization

What it means

Good for

Typical characteristics

Pros

Cons

Pattern B: Classic customer-cloud-heavy organization

What it means

Good for

Typical characteristics

Pros

Cons

Pattern C: Hybrid organization