How To Use Fuzzy Matching In Azure Data Factory
This article explains how to use Fuzzy Matching in Azure Data Factory Data Flow.
One of the most amazing features Microsoft could have unveiled has been released in its Data Flow capacity — Fuzzy Match.
To give you a quick background of what a fuzzy match is, consider the below scenario —

You have 2 tables both containing addresses. As a Data Engineer, you need to match the data from Table A with Table B and assign the corresponding city to the result. Upon analyzing, we can clearly see that the strings are NOT exactly equal (emphasis on exactly). We know that they are the same address but a computer algorithm doesn’t know that. If you performed a SQL join on the two addresses, you won’t get any results.
Hence, the art of approximate string comparison comes into the picture here which we call as fuzzy matching.
In ADF’s Data Flow Activity, this feature is provided when you select a Join transformation. In the image below, I have selected a .csv file as a data source and then picked Join transformation.

When we click the checkbox of fuzzy matching, we see additional information.

- Combine text parts: If we look at the problem above, Newark Street 1800 can be joined with Newark St 1800 or both the strings can be converted to a continuous string such as NewarkStreet1800 and NewarkSt1800. This is especially helpful if your data source contains uneven spaces between the words. Thus, a good idea is to trim all the space so that the matching algorithm has a better shot at properly comparing the two strings.
- Similarity score column: When the algorithm compares addresses from Table A with Table B, it will provide a percentage match to each combination. A similarity score column is that additional column which tells a user how much percentage was allocated by the algorithm to each pair of strings.
Take as an example the table below. I assigned a dummy match percentage to each combination pair between Table A and Table B. As we can see, the highest percentage match is between the strings that the algorithm identifies to be the closest match with each other whereas the other matches fall down on the percentage ladder (between 0 and 100 where 0 is no match and 100 is complete match).

3. Join conditions: As we can see, there’s a slider for similarity threshold with range between 60 to 100. The purpose of this similarity threshold is for the user to set a cut-off percentage. Any value that falls below this percentage will be considered as false matches and any value above this percentage will be treated as a correct match.
📌 Things To Remember
- In order to use fuzzy matching, select the columns whose data types are string. Fuzzy match will not work on any other data type.
- Ensure the broadcast feature is off. To read more about broadcasting, here’s a good documentation from Microsoft.
- For now, fuzzy match is only available for inner, full and left outer joins.
- When we uncheck the fuzzy matching option, we can see that the join condition is provided as shown below. The options are removed in fuzzy matching as the algorithm will only use the equality operator to process the fuzzy logic.

📢 Tip — In my experience of working with fuzzy string matches, the threshold is by far the most step in tagging matches as true matches. It usually is an iterative process where as a Data Engineer, I need to play around with different thresholds to ensure the algorithm is not discarding any true matches.
My recommendation is to start with a conservative similarity threshold between 60% — 70% and see the end results and compare those with the original data. it’s better if you can plot the final matches on a bar graph. This helps in visualizing where most of the match percentages lie. If a majority of the matched data is leaning towards the 80% or higher mark, your threshold is good. If it’s lower, you might want to toggle the threshold to see which baseline value gives you the true matches.
And Voila. This is a great feature to use if your work involves any sort of fuzzy matching or you are not in the mood to write out an entire algorithm on your own to implement a similar functionality.