Azure Data Factory — How To Upsert Data During Copy

Azure Data Factory recently released an extremely helpful feature to upsert data using copy activity. In this article, let’s explore how to utilize this feature in your next data pipeline.

I will copy data from a sample csv spreadsheet that looks like below into my Azure Synapse Data Warehouse.

ADF Upsert Sample dataset
(ADF Upsert Sample dataset)

In ADF, drag copy activity to the blank canvas. In the source dataset, I’ll provide the sample csv file. At the sink dataset, I’ll select the Azure Synapse Data Warehouse and select Auto create table for the first run. This option is to ensure that my copy activity creates the table first and then I can use the upsert feature.

Note: When you use auto create table option, all the columns are created with nvarchar(max) option. Before using the upsert option, change the data type from max to a fixed length such as nvarchar(100) in order for upsert to work.

ADF Copy Auto Create table
(ADF Copy Auto Create table)

After the first run, the table is created in the Azure Synapse Data Warehouse.

Scenario 1: Testing The Upsert (Update Only)

In the sample csv file, I am going to update the value of Article column from Synapse Data Warehouse to Synapse DWh where ID=20.

ADF Upsert Use Case 1
(ADF Upsert Use Case 1)

Back to ADF, go to the sink dataset and select the Upsert radio button.

ADF Copy Activity Upsert
(ADF Copy Activity Upsert)

The following steps need to be undertaken to perform the upsert operation:
👉 Select user DB schema: Here, select the schema of the table we need to perform the upsert upon. In my case, it’s dbo.
👉 Key columns: Here, select the columns that ADF will use to compare the data between the source and the sink. In my example, I will select ID as the column. What this tells ADF is to compare the associated values for IDs 10–50 between the sample csv file and the Azure SQL table. If ADF finds that for an existing ID, its associated value is different in source and sink, it will mark that for upsert and the data will be updated in the Azure Synapse Data Warehouse table.

Going back to my example, I have updated the value for ID = 20. I should see the updated value in the Azure Synapse Data Warehouse table without a change in other values.

ADF Upsert Key Columns
(ADF Upsert Key Columns)

✔️ After running the pipeline, I can see that the updated value for Article column is now shown in my Azure Synapse Data Warehouse.

If I check the details of the copy activity above, I can see that ADF had written rows to an internal table while it checked the data between source and sink for an upsert.

Scenario 2: Testing The Upsert (Insert Only)

What we tested above was a scenario of updating an existing data in the source. How does the upsert work when instead of an update, we have an extra row in the sample csv dataset?

ADF Upsert Use Case 2

In the same sample csv file, I have added a new row with ID = 60. The upsert option should only pick the new row and add it to the table without uploading the entire dataset again.

⏰ After running the pipeline again, let’s check the details of the run again.

✔️ We can see that now 6 rows were written in the internal table that ADF maintains during the upsert copy activity. In the Azure Synapse table, we can see the new record as follows:

📌 Tip: If your use case involves checking multiple columns before performing an upsert, you can select multiple columns from the drop-down option of Key columns.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store