Who this guide is for
This guide is for people who are asking questions like:
- What is a Databricks workspace?
- What is Unity Catalog?
- What is a metastore?
- Why do I need both a workspace and a metastore?
- Where is my data actually stored?
- Which parts live in Databricks, and which parts live in AWS or Azure?
- Which things cost money, and how?
- When should my organization choose serverless or classic compute?
This guide explains the concepts in simple language first, then goes deeper with architecture, storage, cost, ownership, and adoption patterns.
Part 1: The simple mental model
The six most important terms
1. Workspace
A Databricks workspace is the place where users log in and work.
It is the UI and working environment where people:
- open notebooks
- run queries
- create dashboards
- attach compute
- collaborate with teammates
- manage files and code
Think of it as your office.
2. Unity Catalog
Unity Catalog is Databricks’ centralized system for organizing and governing data and AI assets.
It gives you:
- one shared data namespace
- permissions and access control
- governance across workspaces
- auditing and data sharing support
Think of it as the company rulebook plus directory for data.
3. Metastore
A metastore is the top-level root container inside Unity Catalog.
It is the root of the data organization hierarchy.
Think of it as the main library index.
4. Catalog
A catalog is a top-level business area inside the metastore.
Examples:
financesalesmarketingsandbox
Think of a catalog like a major department folder.
5. Schema
A schema is a subfolder inside a catalog.
Examples:
rawstaginganalyticsgold
Think of a schema like a subfolder inside a department.
6. Table
A table is an actual data object.
Example:
sales.analytics.orders
Think of a table as the actual dataset people query.
One sentence summary
- Workspace = where people work
- Unity Catalog = the governance system
- Metastore = the root container in Unity Catalog
- Catalog = major data area
- Schema = subfolder in that area
- Table = actual dataset
Part 2: The most important relationship
A lot of confusion disappears once you separate these two ideas:
- Workspace = user working environment
- Metastore = data governance root
A workspace is not the same as a metastore.
A workspace gets attached to a metastore.
That means:
- users work in the workspace
- the workspace uses the metastore to know which catalogs, schemas, tables, and permissions exist
Diagram: workspace and metastore relationship
flowchart LR
U[Users] --> W[Databricks Workspace]
W --> C[Compute]
W --> M[Unity Catalog Metastore]
M --> CAT[Catalog]
CAT --> SCH[Schema]
SCH --> TBL[Table]
Key idea
The workspace is where users run things.
The metastore is what gives structure and governance to the data those users see.
Part 3: Why both are needed
Why a workspace is needed
You need a workspace because users need a place to:
- log in
- write notebooks
- run compute
- create jobs
- view dashboards
- collaborate
Without a workspace, there is no user-facing working environment.
Why a metastore is needed
You need a metastore because data needs:
- a standard namespace
- permissions
- centralized control
- shared definitions across teams and workspaces
Without a metastore, users may still have compute and notebooks, but data governance becomes fragmented.
Very simple analogy
Workspace is the office
It has:
- desks
- screens
- tools
- people working
Metastore is the central company library catalog
It tells you:
- what books exist
- what sections exist
- who can read them
- how everything is organized
So:
- office != library catalog
- but the office uses the library catalog
Part 4: The full hierarchy
Data hierarchy in Unity Catalog
flowchart TD
MS[Metastore]
MS --> C1[Catalog: finance]
MS --> C2[Catalog: sales]
C1 --> S1[Schema: raw]
C1 --> S2[Schema: analytics]
C2 --> S3[Schema: raw]
C2 --> S4[Schema: analytics]
S2 --> T1[Table: invoices]
S2 --> T2[View: monthly_summary]
S4 --> T3[Table: orders]
S4 --> T4[Volume: documents]
Example names
- metastore:
company-us-east-1 - catalog:
sales - schema:
analytics - table:
orders
Full table name:
sales.analytics.orders
That full name does not include the metastore name in everyday SQL usage.
The metastore sits above the catalog as the governance root.
Part 5: Where things are actually stored
This is where many people get lost.
Important distinction
The metastore stores metadata and governance
It stores things like:
- object definitions
- permissions
- governance relationships
- references to managed or external data
The metastore does not store all the raw data bytes itself
The actual data files for tables and volumes are stored in object storage.
That storage can be:
- Databricks default storage
- AWS S3
- Azure Data Lake Storage
- Google Cloud Storage
depending on setup.
Diagram: metadata versus actual data
flowchart LR
W[Workspace] --> M[Metastore]
M --> META[Metadata and Permissions]
M --> CAT[Catalog/Schema/Table Definitions]
M --> ST[Storage Location References]
ST --> S3[AWS S3 or ADLS or GCS or Databricks Default Storage]
S3 --> FILES[Actual Table and Volume Files]
What is metadata?
Metadata means information about the data, such as:
- table name
- schema columns
- owner
- permissions
- table location
- whether it is managed or external
What is actual data?
Actual data means:
- Parquet files
- Delta files
- Iceberg files
- documents in volumes
- data files in cloud object storage
Part 6: Managed versus external data
This is one of the most useful ideas in Databricks.
Managed tables
A managed table means Databricks manages the table’s storage location based on the managed storage location configured at the metastore, catalog, or schema level.
Use managed tables when:
- you want simpler lifecycle management
- you want stronger governance
- you want the easiest Databricks-native experience
External tables
An external table means the data already exists in your cloud storage, and you register it in Unity Catalog.
Use external tables when:
- you already have data in S3 or ADLS
- multiple systems share the same files
- you want storage independence
- your data lake existed before Databricks
Diagram: managed versus external
flowchart TD
UC[Unity Catalog]
UC --> MT[Managed Table]
UC --> ET[External Table]
MT --> ML[Managed Storage Location]
ML --> OBJ1[Cloud Object Storage Managed by Databricks Rules]
ET --> EX[External Location]
EX --> OBJ2[Existing S3 or ADLS or GCS Path]
Rule of thumb
- For greenfield adoption, managed tables are often simpler
- For existing enterprise lake storage, external tables are common
Part 7: What is created where
Workspace: where it is created
A workspace is created in the Databricks account layer.
Depending on cloud and workspace type:
- classic workspace
- serverless workspace
A workspace is associated with a region and cloud platform.
Metastore: where it is created
A metastore is created at the Databricks account level and linked to one or more workspaces in the same region.
You typically create:
- one metastore per region
not:
- one metastore per user
- one metastore per notebook
- one metastore per team unless there is a strong isolation reason
Catalog: where it is created
Catalogs are created inside the metastore.
Examples:
financesalessandboxml
Schema: where it is created
Schemas are created inside a catalog.
Examples:
rawcuratedanalytics
Table: where it is created
Tables are created inside a schema.
Examples:
finance.analytics.invoicessales.raw.orders
Part 8: Why workspace creation asks AWS account or serverless, but metastore creation does not
This is one of the most important points.
Workspace creation asks about AWS account or serverless because workspace creation is about infrastructure and compute model
When creating a workspace, Databricks needs to know:
- where the workspace runs
- whether the environment is classic or serverless
- who owns more of the infrastructure
- how root/workspace/default storage should behave
That is why AWS versus serverless matters at workspace creation.
Metastore creation does not ask AWS account or serverless in the same way because the metastore is not compute
Metastore creation is about:
- governance
- metadata
- namespace
- storage references for managed objects
So the metastore is about how data is governed, not how compute is provisioned.
Part 9: Which parts live in Databricks versus AWS or Azure
This depends on whether you use serverless or classic patterns.
Simplified ownership model
Usually in Databricks-managed space
- serverless compute
- serverless workspace runtime environment
- Databricks control plane
- default storage features
- metadata services and governance services
Usually in customer cloud account
- classic compute VMs
- customer-managed S3 or ADLS storage
- existing data lake storage
- networking setup for classic workspace deployments
Diagram: ownership split
flowchart LR
subgraph Databricks_Managed[Databricks Managed]
DP[Control Plane]
SC[Serverless Compute]
DS[Default Storage]
UC[Unity Catalog Services]
end
subgraph Customer_Cloud[Customer Cloud Account]
CC[Classic Compute Resources]
CS[Customer Object Storage]
NW[Customer Networking]
end
W[Workspace] --> DP
W --> SC
W --> CC
M[Metastore] --> UC
M --> DS
M --> CS
AWS examples
Can live in Databricks
- serverless notebook compute
- serverless jobs compute
- serverless SQL warehouse infrastructure
- default storage for serverless workspace and default catalog
Can live in AWS account
- classic all-purpose clusters
- classic job clusters
- S3 buckets for external or managed storage
- IAM roles and networking for customer-managed storage access
Azure examples
Can live in Databricks
- serverless compute
- Databricks-managed services
- default storage scenarios
Can live in Azure subscription
- classic compute resources
- ADLS storage for managed or external data
- managed identities or service principals for storage access
Part 10: Cost model – what costs money and how
First principle
Different parts of Databricks cost money in different ways.
Not everything costs the same way.
1. Workspace
Creating a workspace by itself is not usually the main cost driver.
The main costs come from what you use inside it.
Typical cost drivers connected to workspaces:
- compute usage
- storage usage
- network and cloud services
- premium features depending on plan
2. Metastore
A metastore itself is generally not the big cost driver.
A metastore mainly represents governance and metadata organization.
Costs appear when you use:
- managed storage
- cloud storage
- compute that reads and writes data
- serverless or classic workloads against that data
3. Compute
Compute is usually the biggest cost driver.
Serverless compute
You pay based on Databricks usage, usually measured through DBUs and serverless usage.
Examples:
- serverless notebooks
- serverless jobs
- serverless SQL warehouses
Classic compute
You pay for:
- Databricks DBUs
- cloud infrastructure such as EC2 or Azure VMs
- storage and networking around those compute resources
4. Storage
Storage costs depend on where the data lives.
Examples:
- S3 charges in AWS
- ADLS charges in Azure
- Databricks default storage use for supported features and serverless workspace defaults
5. Data transfer and networking
Depending on setup, cloud networking and data transfer can also cost money.
Part 11: Cost by component
Cost table in plain English
| Component | Main purpose | Usually costs by itself? | Main cost type |
|---|---|---|---|
| Workspace | User environment | Usually not the main cost | Platform usage around workspace |
| Unity Catalog | Governance system | Not usually the main cost | Indirect through storage and usage |
| Metastore | Top-level governance root | Usually low direct cost concern | Indirect through data and compute |
| Catalog | Organizing data | No meaningful direct cost alone | None by itself |
| Schema | Organizing data | No meaningful direct cost alone | None by itself |
| Table | Actual data object | Yes, indirectly | Storage + compute to read/write |
| Serverless notebook compute | Interactive notebook execution | Yes | Serverless DBU usage |
| SQL warehouse | SQL analytics compute | Yes | SQL/serverless usage |
| Classic cluster | Customer-cloud compute | Yes | DBU + cloud VM/infrastructure |
| Storage location | Holds data files | Yes | S3/ADLS/GCS/default storage |
Part 12: The three deployment patterns every organization should understand
Pattern A: Serverless-first organization
What it means
The organization prefers Databricks-managed serverless compute for notebooks, jobs, and SQL.
Good for
- fast setup
- low infrastructure burden
- internal training
- analytics teams that want simplicity
- organizations with limited cloud-infra admin support
Typical characteristics
- serverless workspace
- Unity Catalog enabled
- default storage for default catalog
- optional customer cloud storage for additional catalogs
Pros
- fastest time to value
- minimal infra management
- simpler operations
- easier for many users
Cons
- less low-level infrastructure control
- some organizations prefer more customer-owned storage and networking
- advanced customization may push teams toward classic resources
Pattern B: Classic customer-cloud-heavy organization
What it means
The organization uses classic compute in its own cloud account and stores data in customer-managed storage.
Good for
- large enterprise data lake already exists
- strong cloud platform team
- strict networking requirements
- high customization needs
Typical characteristics
- classic workspace
- Unity Catalog enabled
- S3 or ADLS used heavily
- external locations and customer-owned storage patterns
Pros
- more control
- easier alignment with existing cloud architecture
- fits established enterprise landing zones
Cons
- more setup complexity
- more operations burden
- slower onboarding for new teams
Pattern C: Hybrid organization
What it means
The organization uses both:
- serverless for many interactive workloads
- classic compute for special workloads
- customer cloud storage for durable enterprise data
Good for
- most medium and large organizations
- teams adopting gradually
- mixed analytics and engineering workloads
Pros
- balance of simplicity and control
- practical migration path
- good for phased modernization
Cons
- governance and FinOps must be clear
- architecture becomes more complex if standards are weak
Part 13: A very practical “who does what” view
Account admin
Usually responsible for:
- creating workspaces
- creating or assigning metastores
- enabling Unity Catalog
- setting broad governance patterns
Workspace admin
Usually responsible for:
- workspace-level permissions
- compute access and policies
- serverless usage policies
- user onboarding inside the workspace
Data platform team
Usually responsible for:
- storage design
- external locations
- catalog strategy
- naming conventions
- security and permissions
Data users
Usually responsible for:
- creating notebooks
- running queries
- using approved catalogs/schemas
- building dashboards and pipelines
Part 14: What gets shared across workspaces
This is a very important reason Unity Catalog exists.
If multiple workspaces use the same metastore
Then they can share the same:
- catalogs
- schemas
- tables
- volumes
- permissions model
This means you might have:
- Dev workspace
- Test workspace
- Prod workspace
all attached to the same metastore, or to separate metastores depending on your governance design.
Diagram: multiple workspaces sharing one metastore
flowchart TD
W1[Dev Workspace] --> M[Regional Metastore]
W2[Test Workspace] --> M
W3[Prod Workspace] --> M
M --> CAT1[Catalog: finance]
M --> CAT2[Catalog: sales]
M --> CAT3[Catalog: sandbox]
When to share one metastore
Use one metastore across multiple workspaces when:
- they are in the same region
- you want shared governance
- teams need a common namespace
- access control can be handled through permissions
When separate metastores may make sense
Use separate metastores when:
- regions differ
- legal or residency requirements differ
- business units require stronger separation
- platform governance intentionally isolates environments
Part 15: Where to create what in real life
Workspace creation checklist
Create a workspace when:
- a new team needs its own working environment
- you want separate admin boundaries
- you want separate dev/test/prod environments
- you need a new regional deployment
Metastore creation checklist
Create a metastore when:
- a region does not yet have one
- you intentionally need a separate governance root
- legal or organization boundaries require isolation
Do not create a new metastore just because:
- a new analyst joined
- a new notebook is created
- a new project starts
- a new schema is needed
Catalog creation checklist
Create a catalog when:
- you need a major logical data domain
- you want department-level ownership
- you need separate storage or governance boundaries
Examples:
financesalesmarketingsandbox
Schema creation checklist
Create a schema when:
- you want a sub-domain inside a catalog
- you want lifecycle stages like raw/staging/analytics
- you want a team-specific area under a catalog
Table creation checklist
Create a table when:
- you have actual structured data to manage and query
Part 16: Where the data is stored in serverless versus classic setups
Serverless-first example
flowchart LR
U[User] --> W[Serverless Workspace]
W --> SN[Serverless Notebook Compute]
W --> M[Metastore]
M --> C[Default Catalog]
C --> T[Managed Table]
T --> DS[Databricks Default Storage or Customer Cloud Storage]
Typical behavior
- workspace is serverless
- compute is Databricks-managed
- default catalog may use default storage
- additional catalogs may use customer cloud storage
Classic AWS example
flowchart LR
U[User] --> W[Classic Workspace]
W --> CL[Classic Compute in AWS Account]
W --> M[Metastore]
M --> C[Catalog]
C --> T[Managed or External Table]
T --> S3[S3 in Customer AWS Account]
Typical behavior
- workspace exists in Databricks environment
- classic compute runs in customer AWS account
- data commonly lives in S3
- metastore governs access to that data
Azure example
flowchart LR
U[User] --> W[Azure Databricks Workspace]
W --> CL[Classic or Serverless Compute]
W --> M[Metastore]
M --> C[Catalog]
C --> T[Managed or External Table]
T --> ADLS[Azure Data Lake Storage]
Part 17: Common misunderstandings and the correct version
Misunderstanding 1
“Workspace stores everything, so why do I need metastore?”
Correct version
Workspace stores the working environment and some workspace assets.
Unity Catalog metastore governs the enterprise data namespace and permissions.
Misunderstanding 2
“If I create a workspace, a separate metastore must be created for it.”
Correct version
Not always.
A workspace is often assigned to an existing regional metastore.
Misunderstanding 3
“If I use serverless, then Unity Catalog is not needed.”
Correct version
Serverless is a compute model.
Unity Catalog is the governance model.
They solve different problems.
Misunderstanding 4
“Metastore contains all actual data files.”
Correct version
Metastore contains metadata and governance.
Actual data files live in object storage.
Misunderstanding 5
“If I name a usage policy 5usd, Databricks will stop at 5 USD.”
Correct version
A serverless usage policy is mainly for attribution and tagging, not an automatic hard budget cap by default.
Part 18: Cost-conscious learning guide
If you are learning Databricks and want low cost, this is the safest path.
For cheapest learning
Use:
- one workspace already provided by your org
- the existing metastore already assigned
- a sandbox catalog or schema if allowed
- tiny sample datasets
- minimal notebook runtime
Avoid creating:
- unnecessary workspaces
- unnecessary metastores
- extra SQL warehouses
- large classic clusters
Prefer for learning
- serverless notebook with tiny sample data
- very short sessions
- notebook SQL instead of starting many tools
Avoid for learning
- large data scans
- long-running notebooks
- large warehouses
- repeated run-all sessions
Part 19: Adoption guide for organizations
Stage 1: Small team adoption
Typical pattern
- one workspace
- one regional metastore
- one or two catalogs
- mostly serverless or personal compute
- small governance model
Good catalog design
sandboxsharedanalytics
Why this works
- simple
- low friction
- fast onboarding
Stage 2: Department adoption
Typical pattern
- dev and prod workspaces
- shared regional metastore
- business-domain catalogs
- stronger permissions
- serverless plus some classic jobs
Good catalog design
financesalesmarketingml
Why this works
- separates ownership by business area
- still keeps centralized governance
Stage 3: Enterprise platform adoption
Typical pattern
- multiple workspaces by environment and team
- one metastore per region
- clear storage architecture
- managed tables for some workloads, external tables for others
- standardized compute policies
- strong FinOps and security practices
Good catalog design
Based on data domain, environment, regulatory boundary, or platform standards.
Why this works
- scalable governance
- supports multiple teams
- avoids chaos from workspace-by-workspace data silos
Part 20: Recommended decision guide
Choose serverless workspace when
- you want fast setup
- you want less infrastructure work
- your workloads fit serverless-supported patterns
- you want easy training or sandbox environments
Choose classic-heavy setup when
- you need more infrastructure control
- your org already has a strong cloud landing zone
- you require customer-owned networking and storage patterns
- special workloads need custom compute behavior
Choose one metastore per region when
- you want standard Databricks governance practice
- multiple workspaces in that region should share the same namespace
Create a new catalog when
- a major data domain needs separation
- ownership needs to be clear
- storage or governance boundaries differ
Create a new schema when
- you need a logical sub-area under a catalog
- raw/curated/analytics separation is needed
Create a new table when
- you have actual data to store or register
Part 21: Beginner tutorial path
Tutorial 1: Understand the layers
Answer these five questions in your own environment:
- What workspace am I using?
- Is Unity Catalog enabled?
- Which metastore is attached?
- Which catalogs exist?
- Which compute types can I use?
Tutorial 2: Find the data hierarchy
Try to identify:
- one catalog
- one schema
- one table
Example:
- catalog =
samples - schema =
nyctaxi - table =
trips
Tutorial 3: Create a simple sandbox structure
If your permissions allow it:
- create catalog
sandbox - create schema
rajesh - create a small test table
Tutorial 4: Learn managed versus external
Practice with:
- one managed table
- one external table
Observe:
- where each one points
- who controls the storage path
- how permissions are applied
Tutorial 5: Understand cost practically
Run:
- one tiny serverless notebook query
- one tiny SQL warehouse query
Then compare:
- which compute was used
- which billing view records it
- what tags or usage policies appeared
Part 22: A complete end-to-end example
Imagine a company called Acme.
Their setup
- Region:
us-east-1 - Workspaces:
dev,prod - One regional metastore
- Catalogs:
finance,sales,sandbox - Schemas under
sales:raw,analytics - Table:
sales.analytics.orders
How it works
- Users log into the
devworkspace. - They attach a notebook to serverless compute.
- The workspace is already linked to the regional metastore.
- The metastore exposes the
salescatalog. - Inside that catalog, they query
sales.analytics.orders. - Unity Catalog checks permissions.
- The actual table files are read from object storage.
- Compute cost is generated by the serverless notebook run.
- Storage cost is generated by the cloud storage used for the table files.
Why this is powerful
- workspace gives user experience
- compute gives execution
- metastore gives governance
- storage gives persistence
All four layers work together.
Part 23: The shortest version possible
If you remember only this, remember this:
The core model
- Workspace = where users work
- Compute = where code runs
- Unity Catalog = the governance system
- Metastore = the root of that governance system
- Catalog / Schema / Table = the data organization hierarchy
- Storage = where the actual files live
The cost model
- governance objects themselves are usually not the main cost
- compute and storage are the real cost drivers
The cloud model
- serverless = more Databricks-managed
- classic = more customer-cloud-managed
- data can live in Databricks default storage or in customer cloud storage depending on design
Part 24: Final recommended best practices
For individuals learning Databricks
- do not create new metastores unless required
- use the existing workspace and metastore
- work in a sandbox catalog/schema
- use small datasets and short notebook sessions
For small teams
- keep one regional metastore
- define a simple catalog strategy early
- use serverless first where possible
- avoid over-engineering storage on day one
For enterprises
- standardize catalog naming
- design storage intentionally
- document when to use managed versus external
- treat one metastore per region as the default starting point
- use multiple workspaces where admin boundaries or lifecycle differences are needed
- use FinOps tagging and usage monitoring early
Final summary
Databricks has several layers that work together:
- Workspace gives users a place to work.
- Compute runs notebooks, jobs, and queries.
- Unity Catalog governs data and AI assets.
- Metastore is the top-level root of Unity Catalog.
- Catalogs, schemas, and tables organize data.
- Storage holds the actual data files.
When people get confused, it is usually because they mix:
- working environment
- governance
- storage
- compute
- cloud ownership
Once you keep those separate, the whole Databricks architecture becomes much easier to understand.
One-page cheat sheet
What it is
- Workspace = office
- Compute = engine
- Unity Catalog = rulebook
- Metastore = root directory
- Catalog = department folder
- Schema = subfolder
- Table = dataset
Where it lives
- Workspace = Databricks environment
- Metastore = Databricks account-level governance object
- Table files = object storage
- Compute = Databricks-managed serverless or customer-cloud classic
What costs money
- compute
- storage
- networking
- some platform usage
What usually does not matter as a direct cost by itself
- catalog
- schema
- metastore object itself