Web Tutorials     [RO]  [EN]  
| HOME | Tutorials | News | SERVICES | Directory | Tools | FORUM | About | SITE MAP | CONTACT | SEARCH |
.....................................................
.....................................................
User happy birthdayToday we celebrate one day of birth.
(dannyb0y)
.....................................................
Login
Register
I forgot my password
.....................................................
Put your ad here
.....................................................
Online
In total there is
12 visitors online,
of which:
2 we have
10 are bots
.....................................................
Put your ad here
.....................................................
.....................................................
.....................................................
.....................................................
.
Home - Levenstein distance, editing algorithm

<< Creating a simple captcha system with PHP   -   Useful Opera widgets for web developers >>
Rate this article(Members only)
1 2 3 4 5
A - A Announced this way the site administrator for any problems observed on this page.  Print this page as PDF  Email  

Levenstein distance, editing algorithm


Publishing date: 21-02-2011 - Copyright © Andrei Avadanei

In the daily life, we all have to face with computer applications that have a huge amount of information and resources. To satisfy the user in a very short time these applications have developed various systems to facilitate access to the information held. It's very easy to search on Google (or on any other search engine) what we want to find instead of taking a list of a few trillions of sites to make the same operation.

One of the major problems with which search engines face is the carelessness of the end-user, it frequently happens to spell a word wrong. The search engine Google, for example, has a suggestion system that offers alternatives to words written by you, an option that is proving a very useful feature of the engine. For example, if we write "prgrammer" it would be very useful for a system to detect the error and correct it, giving us the "programmer " option.

levenstein

What is the editing distance or Levenstein distance?

Levenstein distance is an algorithm that can help determine the similarity between the phrases or words used in searches. The function can be used to compare various differences between two words. Through Levenstein distance we understand the number of characters that need to be replaced, inserted, removed from the first expression to get the seconde xpression.

Application

  • Search Engines
  • Primitive AI
  • Dictionaries synonyms, antonyms, etc..
  • Longest common subsequence
  • Damerau-Levenshtein Distance
  • Hamming Distance

Issues

1. the problem

The Levenshtein distance between two strings is equal to the minimum number of operations needed to transform the first string in the other. Operations that are allowed are to insert a character, deleting a character or replace a character with another character.

Requirement

Knowing the two strings, determine which is the minimum number of operations needed to transform the first string in the second string.

Solution

Obviously, this issue is resolved by applying the classical algorithm Levenstein distance.The solution is based on dynamic programming. It should be noted that we should not distinguish between uppercase and the rest of the characters.

2. The problem

The Arhirel dragon decides to teach biology, so he wants to buy a tenth grade manual. Unfortunately, it is no longer available on the market, but the dragon is able to find a copy at a friend. Once he begins reading, dragon Arhirel notes that there are some mistakes in his friends book and because of a energy boost, he decides to correct manual. He has been provided a dictionary of M words from which he needs to retrieve variants of the wrong word. On the wrong word the dragon can make the following changes so that it reaches a variant from the dictionary:
- delete a letter;
- insert a letter;
- change a letter into another letter.
However, Dragon Arhirel is lazy, so he doesn't wants to operate more than K changes to the wrong word to bring it to the correct form (found in the dictionary).

Requirement

Help the dragon Arhirel to discover which of the words in the dictionary could be variants of the wrong word.

Solution

Note that a complexity as in the first solution will not lead to a result of 100 points. The classical algorithm for determining the Levenstein distance is O (n * m), where n and m represents the length of the two words compared. Therefore, O (n * m * k) will certainly come out of the execution time. For this, you have to notice that we are interested only in the main diagonal (with the elements around them) and the first k steps (k is the minimum increments we need to decide if we can or can't transform the word in maximum k steps). This solution also uses Levenstein editing algorithm, dynamic programming default.

Improvements

  • Note that we need memory O (m * n), but the dynamics requires only keeping the last two lines from the array so we can reduce it to O (2 * n).
  • We can save the deletion, editing, inserting separately.
  • If we are interested only in determining the editor distances that are smaller than a value k, we use a similar complexity as from the problem 2 of this article.
  • We normalize the results in the interval [0,1]
CODE:
double _levenshtein(char* s1, char* s2) {
   return 1-levenshtein(s1, s2)/max(strlen(s1),strlen(s2));
}
  • We can give different values of penalty for insertion, replacement, deletion.

Additions

You can take a look at the PHP Levenstein function
Levenstein Distance from Wikipedia.

Publishing date: 21-02-2011 - Copyright © Andrei Avadanei   
Click here if you want to see other articles by the same author
There are no comments on this article. Be the first to say your opinion.

Add a comment on this article (members only - login on the site):
Put your ad here