Skip to main content

Synapse Syntax Wish List

goal#

maximize the amount of common TSQL syntax across:

  • SQL Server 2016+
  • Azure SQL
  • Azure Synapse dedicated pool (ASDP)

In doing this, Azure Data platform delivers:

  • lower friction for moving
    • b/w Azure data products, and
    • to Azure data prodcut from competitor's products; and
  • a better
    • overall DW dev experience via ASDP (and dbt), and
    • open-source developer experience by reducing the amount of necessaary code to work with all Azure SQL products.

rationale#

The syntactical difference between ANSI SQL PostgreSQL and TSQL is a barrier to entry for new customers to Azure.

Additionally, the intra-TSQL syntax differences are barriers to entry for users coming to ASDP from on-prem SQL Server. Making matters worse, there is no document or cheatsheet published that details the differences and includes workarounds.

Fortunately for dbt users, the dbt-msft adapters, already handle these syntactic features. This means that a dbt project created in Azure SQL can be "migrated" to work on Azure Synapse by changing literally a single YAML parameter (provided the raw source tables already exist).

Nevertheless, a more standard API would greatly reduce friction of moving to and from the above products for ALL TSQL products. Our team already charged though these differences when moving our warehouse to ASDP before we learned of dbt.

Increasingly, data engineering tools are standardizing on SQL as the lingua franca for working with data. We can see this with:

  • Synapse's serverless pool which allows querying data lakes with a SQL PI, and
  • Databrick's release of SQL Analytics -- which is effectively a SQL DW API on top of delta lake (pyspark and notebooks? never heard of them...)

One final benefit is that closer API alignment would drastically reduce the code footprint of this adapter. Instead, virtually all macros could be defined once in dbt-sqlserver, then be imported. This makes long-term maintenance easier, which, in open-source, is can be a sink-or-swim element.

Syntax Differences

1) Table Valued Constructor#

relevant Uservoice Idea

This is critical for the dbt seed command which loads a local csv into the the database.

The difference has brought the dev team all kinds of problems

Azure SQL#

insert into dbo.ceo (id, name, nerd_cred) values(1, 'Bill Gates', 9),(2, 'Steve Ballmer', 5),(3, 'Satyla Nadella', 7);

Synapse#

insert into dbo.ceo (id, name, nerd_cred)SELECT 1, 'Bill Gates', 9 UNION ALLSELECT 2, 'Steve Ballmer', 5 UNION ALLSELECT 3, 'Satyla Nadella', 7;

2) MERGE#

relevant Uservoice Idea

This is already in public preview, but would love know when it becomes GA. When it does, we can drop synapse__snapshot_merge_sql macro with it's UPDATE->INSERT workaround and rely on the global project's implementation

{% macro synapse__snapshot_merge_sql(target, source, insert_cols) -%}      {%- set insert_cols_csv = insert_cols | join(', ') -%}      EXEC('           BEGIN TRANSACTION           update {{ target }}          set dbt_valid_to = TMP.dbt_valid_to          from {{ source }} TMP          where {{ target }}.dbt_scd_id = TMP.dbt_scd_id            and TMP.dbt_change_type = ''update''            and {{ target }}.dbt_valid_to is null;
            insert into {{ target }} (                  {{ insert_cols_csv }}                  )            select {{ insert_cols_csv }}            from {{ source }}             where dbt_change_type = ''insert'' ;            COMMIT TRANSACTION;           ');{% endmacro %}

3) Drop ... if exists#

DROP [TABLE/VIEW/SCHEMA/INDEX] ... IF EXISTS#

relevant Uservoice idea ("suggested" 2016, "started" 2018 "planned" 2019)

This one is a lower priority because IMHO, using the first statement enables all the products including SQL Server <2016. Though, the simplicity of the ssecond statement is alluring.

Azure Synapse & SQL Server <2016#

-- if object_id ('dbo.clippy','V') is not null  begin  DROP VIEW dbo.clippy  end

Azure SQL & SQL Server >=201#

DROP VIEW dbo.clippy IF EXISTS

4) Accessing data from blob or an external table#

OPENROWSET() vs CREATE EXTERNAL TABLE() vs COPY INTO#

relevant Uservoice idea

These API discrepancies are painfully confusing in that it's a grab bag of nouns. The asks are:

  • can there be common patterns for creating external tables across these TSQL products?
  • can Synapse support RDBMS External tables and Azure SQL support HADOOP External Tables?

The impact on the dbt-msft project is how it affects dbt-external-tables, which currently supports rows 1 & 4 of the table below.

There are huge UX gains left on the table with such disparate syntax across products. We'll add more specific samples later

producttable typetypestatementrequires explicit columnsobjects required
Azure SQLtrue externalRDBMS (i.e. another database)CREATE EXTERNAL TABLEyesExternal Datasource, Scoped Database Credential
Azure SQLmaterialized externalRDBMS (i.e. another database)INSERT INTO … OPENROWSETyesall defined in statement
Azure SQLmaterialized externalHADOOP?INSERT...SELECT * FROM OPENROWSET(BULK...)noExternal File Format, External Datasource, Scoped Database Credential
ASDPtrue externalHADOOP (i.e. blob or datalake)CREATE EXTERNAL TABLEyesExternal Datasource, External File Format, Scoped Database Credential
ASDPmaterialized externalHADOOP?COPY INTO [blob path]noall defined in statement

4) Nested CTEs#

For a deep dive, see this entire treatise on nested CTEs elsewhere on the site.

As of now, dbt data tests that are defined with CTEs fail, as the user-defined data test is itself wrapped into another CTE when executed. There's a way to fix this, but it isn't pretty. See #25

To clarify, this isn't an ask for recursive CTEs, just nested ones. Here's an issue opened on the SQL docs repo

can't be done in ASDP#

WITH dbt__CTE__INTERNAL_test AS (    WITH cte_test AS (        SELECT * FROM "dbo"."clippy"    )    SELECT TOP 0 * FROM cte_test)SELECT COUNT(*) FROM dbt__CTE__INTERNAL_test

5) sp_rename unsupported in Synapse Serverless pools#

You can't rename views and external tables in Synapse serverless SQL pools. It is however, supported in dedicated SQL pools. See MicrosoftDocs/sql-docs/pull/6793

Other Differences

1) tempdb.INFORMATION_SCHEMA.COLUMNS#

relevant Uservoice idea (suggested Mar 31, 2020)

This introduces challenges for the macro get_columns_from_relation(). Basically get_columns_from_relation works fine for normal relations becuase we have the normal INFORMATION_SCHEMA.COLUMNS. But when the relation is a temp table nothing is returned.

You can see the discussion in this PR in which, we're trying to come up with a robust solution as part of our release for v0.19.0.

This breaks the adapter's ability use the snapshot (i.e. SCD type 2) materialization.

The current theory for the best work around is what's suggested in in this Stack Overflow post, which is:

  1. if get_columns_from_relation() is called and the relation is a temp table (i.e. starts with #`) then:
  2. call CTAS to select one from from the temp table
    CREATE TABLE dbo.temp_table_hack AS(SELECT TOP(1) * FROM #actual_temp_table )
  3. recursively call get_columns_from_relation() on this new table
  4. drop the non-temp table

This isn't very pretty, but at least this will dbt users have a frictionless experience moving their dbt projects between TSQL products.

2) Recursive CTEs#

relevant Uservoice idea (suggested: Oct 2016, under review: May 2018)

Don't have to explain this one too much, it's covered clearly in the docs. And there's also documented workarounds

3) External tables of type: RDBMS#

relevant Uservoice idea (suggested: Mar 2020)

The limitation of only allowing 8 persistent connections at 100 DWU requires us to have downstream Azure SQL marts that we expose to end-users. We're using RDBMS/"Elastic Query" External Tables in these Azure SQL dbs to get data from the ASDP.

However, we also often want to ingest from various applications' transactional databases. An Elastic Query connection from ASDP to these transactional db's would save the work of having to set up ADF pipelines and introduce an intermediary blob layer between the db's.

The other benefit would be more streamlined syntax b/w the Azure SQL products.