Thursday, 18 June 2026

Microsoft Fabric Migration Series - Chapter 1

Microsoft Fabric Migration Series

From Synapse Serverless to Microsoft Fabric:
Solving the Deletion Vector Problem

Data Technet Principal Architect June 2026 Azure Synapse · Microsoft Fabric · Delta Lake
SSMS — Synapse Serverless SQL Pool
SELECT TOP 100 * FROM transactions_tbl;
 
Msg 19726, Level 16, State 2, Line 2
Feature 'Deletion Vectors' is not supported for table
'datasource_124485ee-6864-4f64/transactions_tbl'.
Msg 16513, Level 16, State 0, Line 2
Error reading external metadata.
 
Completion time: 2026-06-18T09:14:22.8410531+00:00

That error above is what our analysts and support team woke up to one morning. Not a permissions issue. Not a network problem. The tables were simply gone — inaccessible from SSMS, which is the primary tool our support team and analysts use for day-to-day data investigation and ad-hoc querying.

SSMS access to our Delta Lake tables is not a nice-to-have. It is the mechanism that allows our analysts and support teams to run their own queries, validate data, investigate incidents, and answer business questions — without raising a ticket and waiting for an engineer to spin up a notebook, run the query, and send back results. When that access broke, every data question became an engineering task. The support team was blocked. Analysts were blocked. The entire self-service data capability we had built over months collapsed overnight.

The root cause was a Synapse Spark pool runtime upgrade to Spark 3.5. A long-term migration to Microsoft Fabric was already on the roadmap — but that is a months-long process. We needed a quick win that restored SSMS access now, without waiting for a full platform migration. This post explains what broke, why it broke, and the exact steps we used to restore analyst and support team access within a day using Microsoft Fabric's SQL Analytics Endpoint as a drop-in replacement for Synapse Serverless SQL.




The Upgrade That Broke Everything

For a long time, Deletion Vectors were not a concern for Synapse Serverless SQL users. Synapse Spark pools running on Spark 3.3 (Delta Lake 2.2) or Spark 3.4 (Delta Lake 2.4) could not produce them — the feature simply did not exist in those Delta versions.

That changed when Microsoft released Azure Synapse Runtime for Apache Spark 3.5 (GA, August 2025), which ships with Delta Lake 3.2. Deletion Vectors are a Delta Lake 2.0+ feature, and Delta Lake 3.2 enables them by default on tables that receive DELETE or UPDATE operations.

The upgrade pressure was real. Synapse Spark 3.3 reached end of support in March 2025. Spark 3.4 follows in March 2026. Teams upgrading their Spark pools to remain on a supported runtime found that their Spark jobs began silently producing Delta tables with Deletion Vectors — without any deliberate action on their part.

Runtime
Delta Lake
EOL
Deletion Vectors
Spark 3.3
2.2
Mar 2025
Not possible
Spark 3.4
2.4
Mar 2026
Not possible
Spark 3.5 ← upgrade target
3.2
Current
Produced by default
Fabric Runtime 1.3
3.x (Kernel)
Read correctly
The key insight: The Synapse environment that produces the data (Spark pool, Runtime 3.5, Delta Lake 3.2) and the Synapse environment that queries it (Serverless SQL Pool, Delta Reader 1.0) are two completely separate engines that have diverged. The Spark pool writes modern Delta 3.x tables. The Serverless SQL Pool cannot read them. There is no configuration fix — it is an architectural incompatibility.

Why Synapse Serverless Cannot Handle Deletion Vectors

Synapse Serverless SQL uses the Delta Reader 1.0 protocol — the legacy standalone Delta reader. It understands the basics of the Delta format but has not been updated to support modern Delta Lake features introduced in the Delta 2.x and 3.x protocol.

✕  Synapse Serverless SQL
  • Delta Reader 1.0 (legacy protocol)
  • Deletion Vectors — hard error
  • Liquid Clustering — not supported
  • V2 Checkpoints — not supported
  • Column Mapping — limited support
  • Silently returns deleted rows on older DV tables
✓  Fabric SQL Analytics Endpoint
  • Delta Kernel 3.x (full protocol)
  • Deletion Vectors — correctly applied
  • Liquid Clustering — supported
  • V2 Checkpoints — supported
  • Column Mapping — supported
  • Correct rows returned, no data integrity risk
Silent data correctness risk: On tables where Deletion Vectors exist but the Delta log protocol requirement is not yet enforced, Synapse Serverless does not error — it silently returns logically deleted rows. The hard error only surfaces once the table protocol version is explicitly bumped. Before that point, your analysts may be querying stale, deleted data without any warning.

How Deletion Vectors Work in the Delta Log

Fabric reads the _delta_log, finds the deletion vector reference, loads the .dvd bitmap, and applies it as a row mask — returning only live rows. Synapse Serverless reads the Parquet files directly and ignores deletion_vectors/ entirely.


⚡ Quick Win

Restore SSMS Access in Under a Day — No Data Migration Required

The steps below do not require moving any data out of ADLS Gen2. Your Delta tables stay exactly where they are. Fabric reads them in place via a Shortcut — the same pattern as a Synapse external table, but with a fully modern Delta protocol reader underneath. This is the fastest path to restoring analyst and support team self-service access while the longer-term Fabric migration continues in the background.


Exposing ADLS Gen2 Delta Tables via Fabric SQL Analytics Endpoint

01

Create a Fabric Lakehouse

In your Microsoft Fabric workspace (assigned to an F2 or higher F-capacity): New item → Lakehouse. Name it to match your environment, e.g. enterprise_lakehouse.

Capacity delegated setting: After provisioning any new F capacity, go to Admin Portal → Capacity Settings → select the capacity → Delegate tenant settings → Users can create Fabric items → enable explicitly. This setting does not always inherit from the tenant-level switch and will prevent Lakehouse creation if missed.
02

Create an ADLS Gen2 Shortcut Under Tables

In the Lakehouse → Tables section → right-click → New shortcut → select Azure Data Lake Storage Gen2 (not Azure Blob Storage). Enter the DFS endpoint URL:

https://enterprisedatalake.dfs.core.windows.net

Select Workspace Identity as the authentication method, then navigate to your container and select the individual Delta table folders (each showing the Delta Table icon in the picker).

HNS endpoint matters: If your storage account has Hierarchical Namespace (HNS) enabled — standard for ADLS Gen2 — you must use dfs.core.windows.net. Using the blob.core.windows.net endpoint causes silent failures on Delta log traversal, producing a misleading "invalid credentials" error even when auth is correctly configured.
Tables/ not Files/: The shortcut must be created under the Tables section, not Files. Shortcuts created under Files are visible in the Lakehouse explorer but will not appear in SSMS or the SQL Analytics Endpoint. This is the single most common mistake in this setup.
03

Configure RBAC on the Storage Account

The Workspace Identity of your Lakehouse needs the correct role on the storage account:

-- Azure Portal: Storage Account → Access Control (IAM) → Add role assignment Role: Storage Blob Data Contributor Assign access to: Managed identity Member: Your Fabric Workspace Identity -- NOT Blob Contributor — that is a management plane role and -- does not grant read access to blob data.
Propagation delay: Azure RBAC assignments can take 5–10 minutes to propagate. If you receive 403 errors immediately after assigning the role, wait and retry before investigating further.
04

Wait for Schema Sync

Fabric automatically reads each _delta_log and infers the schema for the SQL Analytics Endpoint. With 20+ tables, allow 10–15 minutes for the sync to complete. Tables appear progressively — do not refresh repeatedly or navigate away, the sync is a background server-side process.

05

Connect in SSMS

In the Fabric workspace → find the SQL analytics endpoint item → ... → Copy SQL connection string. Open SSMS 19.1+:

Server: <guid>.datawarehouse.fabric.microsoft.com Authentication: Azure Active Directory - Universal with MFA Database: enterprise_lakehouse
SSMS version requirement: SSMS 19.1 or later is required. Older versions do not support Fabric SQL endpoint authentication.
06

Grant Read Access to Analysts and Support Teams

The SQL Analytics Endpoint is read-only by design — no INSERT, UPDATE, DELETE, or CREATE TABLE is possible through the endpoint. Schema-level SELECT grants are therefore the maximum permission available, with no risk of accidental data modification from an SSMS session.

-- Grant read access on all tables to a specific user GRANT SELECT ON SCHEMA::dbo TO [analyst@domain.com]; -- Or manage a team via a role CREATE ROLE db_analysts; GRANT SELECT ON SCHEMA::dbo TO db_analysts; ALTER ROLE db_analysts ADD MEMBER [analyst1@domain.com]; ALTER ROLE db_analysts ADD MEMBER [support1@domain.com];

Is the Data Live?

Yes. The shortcut is a live pointer to the ADLS Gen2 path. Every query reads the latest committed state of the Delta table directly from source. Deletion Vectors are applied at query time — rows deleted by a Spark 3.5 job will not appear in Fabric query results. No data is copied to OneLake. Storage costs do not increase.

Change in ADLS Gen2 Visible in Fabric SQL Endpoint
New rows inserted✓ Immediately
Rows updated✓ Immediately
Rows deleted (via Deletion Vectors)✓ Immediately — DVs applied correctly
New columns added✓ After schema sync refresh
New Delta table added to folder⚠ Requires new shortcut

Prerequisites Checklist

RequirementDetail
F2+ Fabric capacity Workspace must be assigned to an F or P capacity. PPU workspace mode is not sufficient for Fabric item creation.
Capacity delegated setting Admin Portal → Capacity Settings → Delegate tenant settings → "Users can create Fabric items" — must be explicitly enabled per capacity.
DFS endpoint (HNS) Use dfs.core.windows.net — not blob.core.windows.net — for HNS-enabled storage accounts.
Storage Blob Data Contributor Data plane role on the storage account. Not to be confused with Blob Contributor (management plane only).
Network access Storage account networking must allow Microsoft.Fabric/workspaces as a resource instance, or be set to "Enable from all networks".
Shortcut under Tables/ Not Files/. Shortcuts under Files are not exposed via the SQL Analytics Endpoint.
SSMS 19.1+ Older versions do not support Fabric SQL endpoint AAD authentication.

Synapse Serverless vs Fabric SQL Analytics Endpoint

Capability Synapse Serverless SQL Fabric SQL Analytics Endpoint
Delta protocol Reader 1.0 (legacy) Delta Kernel 3.x
Synapse Spark 3.5 output ✕ Protocol mismatch ✓ Same Delta generation
Deletion Vectors ✕ Hard error ✓ Fully supported
Liquid Clustering
V2 Checkpoints
SSMS connectivity
Data correctness on DV tables ✕ Silent corruption risk ✓ Correct
Data copy required No No
Setup time Under a day

A Quick Win on the Road to Fabric

SSMS access to Delta Lake tables is not a luxury — it is the self-service data layer that keeps analysts and support teams independent. When that access breaks, every data question becomes a notebook ticket to an engineer. The business slows down, the engineering team gets buried in ad-hoc queries, and trust in the data platform erodes.

The Deletion Vectors error in Synapse Serverless is not a configuration issue and it is not caused by a wrong setting in your Spark pool. It is a direct consequence of upgrading to Synapse Spark 3.5 (Delta Lake 3.2) — a necessary upgrade given that Spark 3.3 reached end of support in March 2025 and Spark 3.4 follows in March 2026. The two Synapse engines — Spark pools and Serverless SQL — have diverged at the protocol level. There is no patch coming.

Microsoft Fabric's SQL Analytics Endpoint restored our analyst and support team SSMS access within a day, with no data migration, no change to the ADLS Gen2 storage structure, and no disruption to existing Spark pipelines. The Fabric migration continues on its longer-term timeline. The business is unblocked today.

For teams running modern Delta Lake workloads written by Synapse Spark 3.5, this is not a workaround. It is the correct long-term architecture.

Thursday, 24 April 2025

Inside Spark Joins: A Deep Dive into Physical Join Strategies

Joins are central to transforming and analyzing data. In Apache Spark—a distributed data processing engine—the efficiency of a join can vary dramatically based on the physical execution strategy chosen. This blog demystifies Spark's physical join strategies and offers real-world comparisons with SQL Server join techniques to help engineers design better-performing data pipelines.


Understanding Broadcast and Shuffle in Spark

Before diving into the physical join types, it's important to understand the two fundamental mechanisms Spark uses to move and access data:

Broadcast

Broadcasting sends a small table to all worker nodes, avoiding shuffle. Ideal for cases where one dataset is significantly smaller than the other. (Azure Synapse Parallel: In Synapse Dedicated SQL Pools, this is equivalent to using replicated tables to minimize data movement during joins.)

Shuffle

Shuffling is the redistribution of data across nodes based on join keys so that matching keys co-locate on the same partition. This process is expensive due to the network and disk I/O it requires.


Join Strategies in Apache Spark

Spark uses its Catalyst Optimizer to select an appropriate physical join strategy based on dataset size, partitioning, and statistics.

1️⃣ Broadcast Hash Join (BHJ)

A fast, memory-efficient join used when one of the tables is small enough to be broadcast across the cluster. This avoids the need to shuffle the larger table, making execution significantly faster.

When to Use:

  • When one side of the join is small (default: <10MB).

How It Works:

  • Spark broadcasts the smaller dataset to every executor.
  • Each executor builds a hash map from the broadcasted table.
  • The larger dataset is scanned locally against the hash map.

Performance:

  • Fastest join method.
  • Avoids shuffles but relies on memory availability.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")
SELECT /*+ BROADCAST(dim_table) */ * FROM fact_table JOIN dim_table ON ...

SQL Server Analogy: Like an indexed nested loop join using an in-memory hash structure.

️2️⃣ Shuffle Hash Join (SHJ)

A suitable join strategy when both datasets are large and unsorted. Spark redistributes data by the join key and builds in-memory hash tables on each partition.

When to Use:

  • Both datasets are large.
  • Sorting is unnecessary or disabled.

How It Works:

  • Spark shuffles both datasets by the join key.
  • On each partition, it builds a hash table on one side and probes it with the other.

Performance:

  • Involves full cluster shuffle.
  • Risk of out-of-memory errors if hash tables grow too large.

SQL Server Analogy: Resembles a standard hash join but distributed.

3️⃣ Sort-Merge Join (SMJ)

An efficient join for large datasets that can be sorted. It relies on sorted inputs and merges them like a traditional merge-sort algorithm, making it suitable for massive volumes.

When to Use:

  • Both sides of the join are large and sortable.

How It Works:

  • Spark shuffles and sorts both sides by the join key.
  • Sorted datasets are merged similarly to merge-sort.

Performance:

  • CPU intensive due to sort operation.
  • More stable than SHJ for large datasets.
spark.conf.set("spark.sql.join.preferSortMergeJoin", true)

SQL Server Analogy: Like a merge join on sorted indexes.

4️⃣ Cartesian Join

A full cross-product join with no join condition provided. Every row from the first table is paired with every row from the second, generating a large result set.

When to Use:

  • When no join condition is specified.

How It Works:

  • Every row of the first dataset is joined with every row of the second.

Performance:

  • Extremely costly (O(n*m) complexity).
  • Use with caution.

SQL Server Analogy: Equivalent to CROSS JOIN.

 

Real-World Join Scenarios

  • BHJ: Lookup product details (small) for each sales transaction (large).

  • SHJ: Joining large raw logs from different systems without indexes.

  • SMJ: Merging time-series data from two high-volume sources.

  • Cartesian Join: Creating all pair combinations for recommender systems.


Common Developer Pitfalls

  • Skewed keys causing large shuffles → Use .repartition() or salting.

  • Unexpected Cartesian joins due to missing conditions → Always validate join keys.

  • Broadcast tables exceeding memory → Monitor sizes, set thresholds.

  • Poor stats lead to bad plans → Use ANALYZE TABLE regularly.


Delta Lake Join Optimizations

Delta Lake enhances Spark joins with features like:

  • Z-Ordering to cluster data for efficient filtering/joining:

OPTIMIZE sales ZORDER BY (customer_id)
  • OPTIMIZE: Compacts small files to reduce overhead.

  • VACUUM: Removes obsolete data:

VACUUM my_table RETAIN 168 HOURS;
  • Data Skipping: Automatically skips files that don’t match the filter/join key.


The Importance of Statistics

The Catalyst Optimizer uses table statistics to choose the best join plan.

Why it Matters:

  • Guides decisions like broadcast vs. shuffle.

  • Poor stats can result in inefficient execution.

ANALYZE TABLE my_table COMPUTE STATISTICS;
spark.conf.set("spark.sql.cbo.enabled", "true")


Optimization Tips

  • Use EXPLAIN to check your physical plan.

  • Reduce data skew with partitioning or salting.

  • Align partitions using .repartition().

  • Monitor joins using Spark UI to detect excessive shuffles.

Inside Spark Joins: A Deep Dive


Spark vs SQL Server – Reference Table

ConceptSQL ServerApache Spark
Indexed JoinNested LoopBroadcast Hash Join
Hash JoinHash JoinShuffle Hash Join
Sorted JoinMerge JoinSort-Merge Join
Cross ProductCartesian JoinCartesian Join
Strategy ControlOptimizer + StatsCatalyst + CBO + Hints


Conclusion

Choosing the right join in Apache Spark can drastically affect performance. By understanding how joins are executed and how Spark makes optimization decisions, developers can take control over performance-sensitive operations. Whether you come from a SQL Server background or are optimizing Spark pipelines directly, the key is in understanding your data and the available strategies.

Stay tuned for upcoming examples and walkthroughs of join tuning in Spark production environments!

Tuesday, 15 April 2025

From Synapse to Fabric: Embracing Cost Flexibility with Autoscale Billing for Spark

 

As a Principal Architect evaluating enterprise data platforms, one of the most appreciated features for customers using Azure Synapse Analytics has been the flexibility and cost efficiency of its Spark pools. With dynamic scaling and a true pay-as-you-go billing model, Synapse Spark enabled workloads to spin up only when needed—without any ongoing costs during idle periods.

This model provided the ideal balance for teams running bursty, exploratory, or scheduled Spark workloads—especially where usage patterns were unpredictable or seasonal.

However, when assessing a transition to Microsoft Fabric, many Synapse customers quickly ran into a cost-related roadblock: Fabric’s Capacity Unit (CU) model required reserving resources upfront, introducing a fixed cost—even if Spark was used occasionally. This shift challenged the financial efficiency Synapse Spark users had come to rely on.

At a recent Big Data conference, discussions with the Microsoft Fabric product team revealed that, at that time, managing costs in Fabric required implementing external processes to scale up and down or pause the Fabric Capacity Units (CUs) when not in use. While this approach could mitigate some costs, it introduced additional overhead and administrative complexity, making it less than ideal for certain operational models.


Comparative Analysis: Synapse Spark Pools vs. Fabric Shared Capacity



The Turning Point: Autoscale Billing for Spark

The introduction of Autoscale Billing for Spark in Microsoft Fabric marked a significant advancement. This feature reintroduced the flexibility found in Synapse by allowing Spark jobs to run on dedicated, serverless resources, billed independently from Fabric capacity. It effectively brought back the pay-as-you-go model, enabling dynamic scaling of Spark workloads without the constraints of reserved capacity.

Key Benefits:

  • Cost Efficiency: Pay only for the compute used during Spark job execution, eliminating idle costs.

  • Independent Scaling: Spark workloads scale separately from other Fabric services, ensuring optimal performance.

  • Resource Isolation: Dedicated serverless resources prevent resource contention with other workloads.

  • Quota Management: Set maximum CU limits to control budget and resource allocation.

This model aligns perfectly with our operational patterns, allowing us to run ad-hoc and bursty Spark jobs without overcommitting resources.


Implementing Autoscale Billing: A Step-by-Step Guide

Enabling Autoscale Billing for Spark in Microsoft Fabric is straightforward:

  1. Navigate to the Microsoft Fabric Admin Portal.

  2. Under Capacity settings, select your desired capacity.

  3. In the Autoscale Billing for Fabric Spark section, enable the toggle.

  4. Set the Maximum Capacity Units (CU) limit according to your requirements.

  5. Click Save to apply the settings.

Note: Enabling or adjusting Autoscale Billing settings will cancel all active Spark jobs running under Autoscale Billing to prevent billing overlaps.

Monitoring and Cost Management

Post-implementation, we utilized Azure's Cost Management tools to monitor compute usage effectively:

  • Access the Azure portal and navigate to Cost Analysis.

  • Filter by the meter "Autoscale for Spark Capacity Usage CU" to view real-time compute spend for Spark workloads.

This transparency allowed us to track expenses accurately and adjust our strategies as needed.

The introduction of Autoscale Billing for Spark in Microsoft Fabric addresses a critical concern for Synapse Spark customers—maintaining cost flexibility while transitioning to a modern, unified analytics platform. By allowing Spark jobs to run on dedicated serverless compute, billed independently from reserved Fabric capacity, it brings back the on-demand model that many teams have relied on for years.

This feature, currently in Preview, represents a major step forward in making Microsoft Fabric more accessible and cost-efficient for diverse Spark workloads. I’m looking forward to seeing this capability move into General Availability soon, unlocking its full potential for broader adoption in production-grade environments.

For a detailed walkthrough on configuring Autoscale Billing for Spark, refer to the official documentation here.

Tuesday, 21 February 2023

Powershell to Restore Azure SQL database PITR backup

Today there seems to be a bug on Portal resource Manager API not allowing any SQL databases pitr restores from portal and the only way to do it at this point is using scripts, The below script is an example to restore the database on the same server with Feb 20th 3 PM backup

Copyright © 2023 Vinoth N Manoharan.The information provided in this post is provided "as is" with no implied warranties or guarantees.

Power BI DAX for Start of Year To End of Prior Month Sales

We had a requirement for Power BI Measure Start of the year To End of Prior Month Sales sum for one of the report which was failed for January as the original Measure was just query Previous Month and going to a negative Data Diff value obviously for January Eg:DATEDIFF(1jan2023,31DEC2022). To avoid this we had to introduce a check on the measure if the month is january then we need SUM of Sales of January rather than previous Month. I used the below logic to solve this issue
Copyright © 2023 Vinoth N Manoharan.The information provided in this post is provided "as is" with no implied warranties or guarantees.

Bug In Azure API creating Data Factory Default AutoResolveIntegrationRuntime on Managed Virtual Network

Recently our terraform Infrastructure as code effort to create a datafactory resource with default AutoResolveIntegrationRuntime in Managed Virtual Network failed to create one.

We had managed_virtual_network_enabled flag on terraform and tried to use native azure cli too as a backup but both failed to create the default integration runtime on Managed Virtual Network and created a default public integration runtime. Only work around it was to create a Data Factory using ARM Template. I exported the ARM template and created the below Powershell to create the Data factory. I stored the Template and parameter file on a Fileshare and used it in the powershell to reuse and create a parameter file each time for different DF creation.
Copyright © 2023 Vinoth N Manoharan.The information provided in this post is provided "as is" with no implied warranties or guarantees.

Powershell Azure Devops REST API to create/update a variable group based on json variable input

 I had a requirement to automate creation and updating of Azure devops variable group based on a configuration variable json file proved to me by the application team for CICD process, the below script will look for the variable group if it exists with the project mentioned if not it will create a new variable group, If the variable group exists it will update the variable group with json provided. The idea is to maintain the json file on git for version and see the history of the changes on the variable group variables provided to the pipeline.

The below is the format of the Variable json passed on to the script

Copyright © 2023 Vinoth N Manoharan.The information provided in this post is provided "as is" with no implied warranties or guarantees.