Skip to main content

What is Azure Synapse Dedicated SQL Pool?

Azure Synapse Dedicated SQL Pool (formerly Azure SQL Data Warehouse) is a massively parallel processing (MPP) database similar to other columnar-based scale-out technologies such as Snowflake, Amazon Redshift, and Google BigQuery. To end users it behaves much like traditional SQL Server, but behind the scenes it distributes storage and processing across multiple nodes. This can dramatically improve performance for data warehouses larger than a few terabytes, though it may not be the right choice for smaller implementations. Because the underlying architecture differs significantly from traditional SQL Server, syntax and development methodologies also differ — however, when using Jet Analytics Data Integration with Azure Synapse Analytics, all required code is generated automatically.

See Use Azure Synapse Dedicated SQL Pool for Prepare Instance Storage for configuration instructions.

Compute vs storage costs

Azure Synapse Analytics separates compute and storage costs. The service can be paused, releasing compute resources back to Azure. While paused, charges apply only to the storage in use (approximately $125 USD/month/TB at time of writing). Data remains intact but cannot be queried during this time. Resuming the SQL Pool re-allocates compute resources and makes data available again.

Computation in a Synapse SQL Pool is measured in Data Warehouse Units (DWUs), similar to Azure SQL Database's DTUs. Adjusting DWUs increases or decreases the number of available compute nodes and the relative performance and cost of the service.

Architecture

Processing in a Synapse SQL Pool is distributed across multiple nodes of different types:

  • Control node: Accepts end-user queries, then optimises and coordinates them to run in parallel across the compute nodes.
  • Compute nodes: A Synapse SQL Pool can distribute processing across up to 60 compute nodes depending on the service level. The more DWUs assigned, the more compute nodes are used. Traditional SQL databases are limited to the resources of a single machine.
  • Azure Storage: Data is maintained in Azure Storage separate from the control and compute nodes, ensuring data integrity while scaling. Tables are always spread across 60 distributions to further optimise processing of large data sets.
  • Data Movement Service (DMS): Manages the movement of data across compute nodes.

Table distribution

To balance processing across nodes, tables are split across 60 distributions — a process also known as sharding. The distribution method determines how rows are organised across nodes.

  • Round-robin (default): Distributes rows randomly and evenly across nodes. Not optimised for query performance, as there is no logic to how data is split. Joining to round-robin tables often requires data shuffling, which increases query time. Ideal for staging tables.
  • Replicated: Replicates all rows of the table on every node. Load times are slower, but queries are fast because data shuffling is never required. Ideal for dimension tables smaller than 2 GB.
  • Hash: Distributes rows using a designated distribution column. All rows with the same value in the distribution column are placed on the same node. Delivers the highest query performance for joins and aggregations on large tables. Ideal for fact tables and dimension tables larger than 2 GB.

See Set distribution for tables in Jet Analytics Data Integration for instructions on setting the distribution type and column.

Choosing a distribution column

Use the following guidelines to select a distribution column that maximises query performance:

  • No updates: Distribution columns cannot be updated once set.
  • Even distribution of values: For best performance, all distributions should contain roughly the same number of rows — query speed is limited by the largest distribution. Aim for columns with:
    • Many unique values — more unique values increases the likelihood of an even distribution.
    • Few or no null values.
    • Not a date column — if all users filter on the same date, only one node processes the work.
  • JOIN or GROUP BY column: Selecting a column commonly used in joins or GROUP BY clauses reduces data movement when processing queries.
  • No good single option: Create a composite column from multiple join columns to use as the distribution key.

When should I consider Azure Synapse Analytics?

  • Database larger than 1 TB: Tables in Azure Synapse are always spread across 60 distributions, so performance gains are not typically realised until the data warehouse exceeds 1–5 TB. As a general rule, data warehouses smaller than 1 TB will perform better on Azure SQL Database than on a Dedicated SQL Pool.
  • Tables with more than 100 million rows: Database size alone is not the only consideration. Since distribution happens at the table level, if all tables contain fewer than 100 million rows, the performance boost from Azure Synapse may not be significant.
  • Fewer than 128 concurrent queries: When the SQL Pool receives more than 128 concurrent queries, it queues them on a first-in, first-out basis. Azure SQL Database and Analysis Services support significantly higher concurrent query loads. For high-concurrency workloads, feeding data into an Analysis Services server is recommended.
  • Willingness to tune: Distribution strategy should be based on how users query the data, and getting it right can significantly affect performance. Like any data warehouse, a Synapse SQL Pool requires monitoring and tuning of distribution keys, indexes, caching, and partitions to achieve optimal performance. It is not a universal fix for query performance problems.

Was this article helpful?

We're sorry to hear that.