Skip to content

Fabric Lakehouse

Load Spindle-generated data into a Microsoft Fabric Lakehouse as Delta tables via the Files API or OneLake paths.

Quick Start

from sqllocks_spindle import Spindle, RetailDomain

result = Spindle().generate(domain=RetailDomain(), scale="small", seed=42)

# Write Parquet files to the Lakehouse Files area
result.to_parquet("/lakehouse/default/Files/spindle/retail")

Once uploaded, register the Parquet files as Delta tables using a Fabric notebook:

for table_name in result.table_names:
    path = f"/lakehouse/default/Files/spindle/retail/{table_name}.parquet"
    df = spark.read.parquet(path)
    df.write.format("delta").mode("overwrite").saveAsTable(table_name)

Writing to Lakehouse Files

Spindle's LakehouseFilesWriter handles path resolution and format selection:

from sqllocks_spindle.fabric import LakehouseFilesWriter

writer = LakehouseFilesWriter(
    lakehouse_path="/lakehouse/default",
    subfolder="spindle/retail",
    format="parquet",          # parquet | csv | jsonl
)
writer.write(result)

OneLake Paths

When running outside a Fabric notebook, use full abfss:// paths:

writer = LakehouseFilesWriter(
    lakehouse_path="abfss://workspace@onelake.dfs.fabric.microsoft.com/lakehouse.Lakehouse",
    subfolder="spindle/retail",
)

CLI

# Generate and write directly to Lakehouse Files
spindle generate retail --scale small --format parquet --output /lakehouse/default/Files/spindle/retail

# From a local machine, generate files then upload separately
spindle generate retail --scale medium --format parquet --output ./output/retail

File Organization

Spindle writes one file per table, organized by domain:

/lakehouse/default/Files/spindle/
└── retail/
    ├── customer.parquet
    ├── address.parquet
    ├── product.parquet
    ├── order.parquet
    ├── order_line.parquet
    └── ...

Scale Recommendations

Lakehouse Scale Spindle Scale Approx. Rows Approx. Size
Dev / POC demo ~5K < 1 MB
Small small ~15K ~5 MB
Medium medium ~1M ~200 MB
Large large ~15M ~3 GB

Tips

  • Use parquet format for best Lakehouse performance and schema preservation.
  • Set seed=42 (or any fixed seed) for reproducible datasets across environments.
  • For multi-domain loads, generate each domain into its own subfolder.
  • After loading, create shortcuts to other Lakehouses or Warehouses as needed.

Profiling Lakehouse Tables (Inference Depth)

LakehouseProfiler reads Delta tables directly from a Fabric Lakehouse over ABFSS and returns the same DatasetProfile as other profiler entry points. Use it to infer a schema from existing Lakehouse tables and generate statistically faithful synthetic copies.

Requires the [fabric-inference] extra:

pip install sqllocks-spindle[fabric-inference]

Profile a Single Table

from sqllocks_spindle.inference import LakehouseProfiler, SchemaBuilder

lp = LakehouseProfiler(
    workspace_id="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    lakehouse_id="yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy",
    # token_provider=None uses DefaultAzureCredential automatically
)

profile = lp.profile_table("sales_orders")
profile = lp.profile_table("sales_orders", sample_rows=100_000)  # default
profile = lp.profile_table("sales_orders", sample_rows=None)      # full scan

Profile All Tables

# Returns dict[str, DatasetProfile] — one entry per table
profiles = lp.profile_all(sample_rows=100_000)

for table_name, profile in profiles.items():
    schema = SchemaBuilder().build(profile, domain_name=table_name)
    print(f"{table_name}: {schema}")

Save and Reuse Profiles

from sqllocks_spindle.inference import ProfileIO

ProfileIO.save(profile, "sales_orders_profile.json")
profile = ProfileIO.load("sales_orders_profile.json")

End-to-End: Lakehouse → Synthetic Data

from sqllocks_spindle import Spindle
from sqllocks_spindle.inference import LakehouseProfiler, SchemaBuilder

lp = LakehouseProfiler(workspace_id="...", lakehouse_id="...")
profile = lp.profile_table("sales_orders")

schema, registry = SchemaBuilder().build(
    profile,
    domain_name="sales",
    correlation_threshold=0.5,
    include_anomaly_registry=True,
)

result, fidelity = Spindle().generate(schema, seed=42, fidelity_profile=profile)
fidelity.summary()

See Also