Tutorial 10: Fabric Lakehouse¶

Write synthetic data to a Microsoft Fabric Lakehouse as both Parquet files and Delta Lake tables.

Prerequisites¶

Python 3.10 or later
pip install sqllocks-spindle and pip install deltalake
Completed Tutorial 01: Hello, Spindle!
A Microsoft Fabric workspace with a Lakehouse (for deployment; local simulation works without one)

What You'll Learn¶

How to generate retail data and write it to Parquet files (simulating the Lakehouse Files section)
How to write the same data to Delta Lake format (simulating the Lakehouse Tables section)
How to inspect the resulting file structure and verify round-trip fidelity
How partitioning works in a medallion folder layout

Step 1: Generate Retail Data¶

Spindle's Spindle.generate() produces a complete relational dataset. We will use the small scale, which gives enough data to see realistic file sizes without waiting long.

from sqllocks_spindle import Spindle, RetailDomain

result = Spindle().generate(domain=RetailDomain(), scale="small", seed=42)

print(result.summary())

The result object holds every generated table. You can inspect table names with result.table_names and access DataFrames through result.tables["customer"].

Step 2: Write to Parquet (Lakehouse Files)¶

In a real Fabric Lakehouse, the Files section is the raw landing zone where upstream data arrives as Parquet or CSV. One call writes every table to its own Parquet file:

files = result.to_parquet("./lakehouse_demo/Files/landing/retail/")

print(f"Wrote {len(files)} Parquet files.")
for f in files:
    print(f"  {f}")

The directory structure mirrors what you would see inside a Lakehouse: Files/landing/retail/customer.parquet, Files/landing/retail/order.parquet, and so on.

Step 3: Inspect the File Structure¶

Understanding file sizes and layout helps you estimate storage costs, plan partition strategies, and verify that your data landed where expected.

from pathlib import Path

print("=== Lakehouse File Structure ===")
for f in sorted(Path("./lakehouse_demo").rglob("*.parquet")):
    print(f"  {f.relative_to('./lakehouse_demo')}: {f.stat().st_size:,} bytes")

Step 4: Write to Delta Lake (Lakehouse Tables)¶

Fabric Lakehouse Tables are Delta Lake tables. Delta adds ACID transactions, time travel, and schema enforcement on top of Parquet. Writing Delta locally lets you test downstream notebooks that rely on MERGE, UPDATE, or time-travel queries.

delta_files = result.to_delta("./lakehouse_demo/Tables/retail/")

print(f"Wrote {len(delta_files)} Delta tables.")
for f in delta_files:
    print(f"  {f}")

# Show the Delta log structure for the first table
delta_dirs = sorted(Path("./lakehouse_demo/Tables/retail/").iterdir())
if delta_dirs:
    first_table = delta_dirs[0]
    print(f"\n=== Delta structure for {first_table.name} ===")
    for item in sorted(first_table.rglob("*")):
        print(f"  {item.relative_to(first_table)}")

Each Delta table directory contains a _delta_log/ folder with JSON transaction logs alongside the Parquet data files.

Step 5: Verify Round-Trip Fidelity¶

Read the Parquet files back into DataFrames and compare row counts to the originals. If you lose rows or corrupt data types during export, downstream tests will be unreliable.

import pandas as pd
from pathlib import Path

print("=== Round-Trip Verification ===")
parquet_dir = Path("./lakehouse_demo/Files/landing/retail/")
for pf in sorted(parquet_dir.glob("*.parquet")):
    table_name = pf.stem
    df_read = pd.read_parquet(pf)
    original_rows = len(result.tables[table_name])
    read_rows = len(df_read)
    status = "MATCH" if original_rows == read_rows else "MISMATCH"
    print(f"  {table_name}: original={original_rows}, read_back={read_rows} [{status}]")

print("\nRound trip complete!")

Step 6: Medallion Folder Layout¶

When building a full medallion pipeline, organize your Lakehouse into three tiers. This is the folder structure you would create:

from pathlib import Path

base_dir = Path("medallion_demo")
bronze_dir = base_dir / "bronze" / "retail"
silver_dir = base_dir / "silver" / "retail"
gold_dir   = base_dir / "gold" / "retail"

for d in [bronze_dir, silver_dir, gold_dir]:
    d.mkdir(parents=True, exist_ok=True)

# Write raw data to bronze
paths = result.to_parquet(bronze_dir)
print(f"Bronze: wrote {len(paths)} Parquet files to {bronze_dir}/")
for p in paths:
    print(f"  {p.name} ({p.stat().st_size / 1024:.1f} KB)")

The bronze layer holds raw data as it arrived. Silver and gold tiers are covered in Tutorial 13: Medallion Architecture.

Run It Yourself

Notebook: T08_fabric_lakehouse.ipynb

Notebook: F01_medallion_architecture.ipynb

Fabric Lakehouse guide -- the condensed reference for Lakehouse writes

Next Step¶

Tutorial 11: Fabric Warehouse -- load synthetic data into a Fabric Warehouse using T-SQL DDL and INSERT statements.