One of the use cases that I came across while working on a project is to compare two addresses. Now while this may sound straightforward, I assure you it's not. There are a lot of things to take into consideration here.
Consider you have two addresses and you need to find out if they are the same or not. Another form of this scenario could be comparing if two names belong to the same person on two or more different invoices.
While you may use the same words everywhere , people tend to make spelling mistakes or just shorthand addresses and names. Unless you have an ML model setup it's fairly difficult to compare; for example 122 Regent Street, IL to 123 Rgent St, IL. See the problem?
Let me first talk about the technical aspects. Since this post is not about the technical nitty gritty about the algorithms, I will just give a brief introduction. Please look them up on Google for a more detailed explanation. There are multiple algorithms that are used to tackle this sort of a problem. A few of them are :
1. Levenshtein : This is the minimum number of single-character edits required to change one word into the other.
2. N-Gram : This model predicts the most probable word that might follow a sequence.
3. Jaro-Winkler : This is the weighted sum of percentage of matched characters from each text and transposed characters.
To create a bot or a sequence to do it, first download the following package:
Then use one of the Algorithm provided by the package. I use Jaro-Winkler here.
If you see the results, you will see a score tending towards one. Now this is the part where you may need to test with more data to decide what threshold makes sense to your scenario. For me, I set it up at .84.
While this may seem like a very small process, it actually fits into a lot of large puzzles and could even drive a lot of value add to your work!
I hope you enjoyed this post! Also if you are interested to know how to build a Handwriting recognition bot check out my post here.