Levenstein distance, editing algorithm
In the daily life, we all have to face with computer applications that have a huge amount of information and resources. To satisfy the user in a very short time these applications have developed various systems to facilitate access to the information held. It's very easy to search on Google (or on any other search engine) what we want to find instead of taking a list of a few trillions of sites to make the same operation.
One of the major problems with which search engines face is the carelessness of the end-user, it frequently happens to spell a word wrong. The search engine Google, for example, has a suggestion system that offers alternatives to words written by you, an option that is proving a very useful feature of the engine. For example, if we write "prgrammer" it would be very useful for a system to detect the error and correct it, giving us the "programmer " option.

What is the editing distance or Levenstein distance?
Levenstein distance is an algorithm that can help determine the similarity between the phrases or words used in searches. The function can be used to compare various differences between two words. Through Levenstein distance we understand the number of characters that need to be replaced, inserted, removed from the first expression to get the seconde xpression.
Application
- Search Engines
- Primitive AI
- Dictionaries synonyms, antonyms, etc..
- Longest common subsequence
- Damerau-Levenshtein Distance
- Hamming Distance
Issues
1. the problem
The Levenshtein distance between two strings is equal to the minimum number of operations needed to transform the first string in the other. Operations that are allowed are to insert a character, deleting a character or replace a character with another character.
Requirement
Knowing the two strings, determine which is the minimum number of operations needed to transform the first string in the second string.
Solution
Obviously, this issue is resolved by applying the classical algorithm Levenstein distance.The solution is based on dynamic programming. It should be noted that we should not distinguish between uppercase and the rest of the characters.
2. The problem
The Arhirel dragon decides to teach biology, so he wants to buy a tenth grade manual. Unfortunately, it is no longer available on the market, but the dragon is able to find a copy at a friend. Once he begins reading, dragon Arhirel notes that there are some mistakes in his friends book and because of a energy boost, he decides to correct manual. He has been provided a dictionary of M words from which he needs to retrieve variants of the wrong word. On the wrong word the dragon can make the following changes so that it reaches a variant from the dictionary:
- delete a letter;
- insert a letter;
- change a letter into another letter.
However, Dragon Arhirel is lazy, so he doesn't wants to operate more than K changes to the wrong word to bring it to the correct form (found in the dictionary).
Requirement
Help the dragon Arhirel to discover which of the words in the dictionary could be variants of the wrong word.
Solution
Note that a complexity as in the first solution will not lead to a result of 100 points. The classical algorithm for determining the Levenstein distance is O (n * m), where n and m represents the length of the two words compared. Therefore, O (n * m * k) will certainly come out of the execution time. For this, you have to notice that we are interested only in the main diagonal (with the elements around them) and the first k steps (k is the minimum increments we need to decide if we can or can't transform the word in maximum k steps). This solution also uses Levenstein editing algorithm, dynamic programming default.
Improvements
- Note that we need memory O (m * n), but the dynamics requires only keeping the last two lines from the array so we can reduce it to O (2 * n).
- We can save the deletion, editing, inserting separately.
- If we are interested only in determining the editor distances that are smaller than a value k, we use a similar complexity as from the problem 2 of this article.
- We normalize the results in the interval [0,1]
CODE:
double _levenshtein(char* s1, char* s2) {
return 1-levenshtein(s1, s2)/max(strlen(s1),strlen(s2));
} - We can give different values of penalty for insertion, replacement, deletion.
Additions
You can take a look at the PHP Levenstein function
Levenstein Distance from Wikipedia.
| There are no comments on this article. Be the first to say your opinion. |






