Back during my sophomore year of university (about six years ago now), I wrote some code to compute edit distances. This was unfortunately before I'd learned that fine art of commenting, so those classes are a little difficult to understand. I'm pretty sure though I understand what's going on.
Th basic idea is edit distance which is a metric of the difference between two strings. There are four possible transformation operations:
Any string may be transformed to any other through a combination of these four operations. A value is assigned to each and the summation of those values is used to come up with the edit distance. For these examples we will use a uniform value of one for insert, remove and replace and zero for match. Some examples:
Another method of visialising the process is in a table. The source is on the y-axis and the destination on the x-axis. The process starts at the upper left corner and moves to the lower left. For each position there are each of the four possible operations as options:
f | r | o | m | ||
---|---|---|---|---|---|
→ | ↘ | ||||
r | ↘ | ||||
o | ↘m↘ | ||||
l | ↓ | ||||
l | • |
There are three non-match operations there for a cost of three. There are other possible pathes, like:
f | r | o | m | ||
---|---|---|---|---|---|
↓ | |||||
r | ↓ | ||||
o | ↓ | ||||
l | ↓ | ||||
l | → | → | → | → | • |
That is a cost of eight (four deletes and four inserts). When speaking of edit distance however, one is really speaking of minimum edit distance, or the lowest cost path throught the table.
This code finds that path, but it is primarily concerned with another problem. Often, in situations like spell checking, you want the edit distance relative to a group of strings (to find the possible corrections). To find this fee thousands of times is too computationally expensize, so tables of counts of characters are built to estimate the possible edit distance. For example, for the word "test" this is the table:
t | e | s | t | |
---|---|---|---|---|
t | 2 | 1 | 1 | 1 |
e | 1 | 1 | 0 | 0 |
s | 1 | 1 | 1 | 0 |
Each cell for column i and row j means that for the letter j there are n occurences of that letter at or after i in the string. For example the first cell is 2. The row begins with 't'. This means that there are two t's at or after the beginning of "test." The cell to the right is 1 because there is only one 't' after the first letter of "test."
This table can be used to put a lower bound on the edit distance. If the source has three t's after the current position and the destination has none then there will have to be at least three inserts or replaces to get those t's.