Paul Heckel's Diff Algorithm

7/23/2012 - 8:33 PM
Paul Heckel's Diff Algorithm

[Isolating Differences Between Files][paper]
============================================

Advantage over Other Algorithms
-------------------------------
The diff output is more specific:

>[I]f a whole block of text is moved, then all of it, rather than just the beginning and end, is detected as changed.

>The algorithm described here avoids these difficulties. It detects differences that correspond very closely to our intuitive notion of difference.

Compare to Python's [`difflib`](http://blog.doughellmann.com/2007/10/pymotw-difflib.html). We get an in-line diff instead of a block-level diff. Instead of this:

~~~diff
-I did not have sexual relations with that woman.
+I may have had sexual relations with that woman.
~~~

... you get this:

I <del>did not have</del> <ins>may have had</ins> sexual relations with that woman.

Running Time
------------
Linear time and space for all cases.

The Algorithm
-------------
>To compare two files, it is usually convenient to take a line as the basic unit, although other units are possible, such as word, sentence, paragraph, or character. We approach the problem by finding similarities until only differences remain. We make two observations:

### Preliminary Observations ###
1. If a line occurs only once in each file, then it must be the *same* line, although it may have been moved.

    We use this observation to **locate** unaltered lines that we subsequently **exclude** from further treatment.

2. If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the *same* line. This information can be used to find blocks of unchanged lines.

### Input, Output, and Data Structures ###

* Files:
    1. `O`, the old file
    2. `N`, the new file
* Symbol table:
    * `table`
* Arrays:
    1. `OA`
    2. `NA`

#### The Symbol Table `table` ####

Each `line` works as the key in the `table` look-up, i.e. as `table[line]`.

The symbol table stores three entries for each line:

1. Counters:
    1. `OC`
    2. `NC`
2. Line reference:
    * `OLNO`

In Python:

~~~python
table = {
    line: {
        "OC":   ocurrences_of_line_in_o
        "NC":   ocurrences_of_line_in_n
        "OLNO": line_number_in_o
    }
}
~~~

##### Counters `OC` and `NC` #####

The value entry for each line in `table` has two counters. They specify the line's number of occurrences in `O` and `N`: `OC` and `NC`.

`OC` and `NC` can assume three values: 1, 2, and many.

##### The Line Reference `OLNO` #####

Aside from the two counters, the line's entry also includes a reference to the line's line number in `O`: `OLNO`.

`OLNO` is only interesting, if `OC == 1`. Alternatively, `OLNO` would have to assume multiple values or none at all.

#### The Arrays `OA` and `NA` ####

The arrays `OA` and `NA` have one entry for each line in their respective files, `O` and `N`. The arrays contain either:

* a pointer to the line's symbol table entry, `table[line]`
* the line's number in the *other* file (`N` for `OA`, `O` for `NA`)

... and some code to determine which. (See Third Pass part c.)

In other words, you would have for each array:

* `OA`
    * one entry for each line of file `O` containing either
        * a pointer to `table[line]`
        * the line's number in file `N`

* `NA`
    * one entry for each line of file `N` containing either
        * a pointer to `table[line]`
        * the line's number in file `O`

In Python:

~~~python
OA = []
NA = []
~~~

### The Passes ###

The algorithm has six passes:

#### First Pass ####

* a. Each line `i` of file `N` is read in sequence
* b. An entry for each line `i` is created in the table, if it doesn't already exist
* c. `NC` for the line's `table` entry is incremented
* d. `NA[i]` is set to point to the `table` entry of line `i`

>a. each line `i` of file `N` is read in sequence

~~~python
i = 0
for line in N:
~~~

>b. An entry for each line `i` is created in the table, if it doesn't already exist

~~~python
if not line in table:
    table[line] = {}
#   (...)
# else: 
#   (...)
~~~

>c. `NC` for the line's `table` entry is incremented

~~~python
if not "NC" in table[line]:
    table[line]["NC"] = 1
else:
    if table[line]["NC"] == 1:
        table[line]["NC"] = 2  # Why not just have 1 and "many"?
    else:
        table[line]["NC"] = "many"
~~~
    
>d. `NA[i]` is set to point to the `table` entry of line `i`

~~~python
NA[i] = table[line]
i += 1
~~~

All steps combined:

~~~python
i = 0

for line in N:
    if not line in table:
        table[line] = {}
        table[line]["NC"] = 1
    else:
        if table[line]["NC"] == 1:
            table[line]["NC"] = 2
        else:
            table[line]["NC"] = "many"
    NA[i] = table[line]
    i += 1
~~~


#### Second Pass ####

Similar to first pass, except it acts on files
* `O`
* `OA`
* `OC`
* `OLNO` which is set to the line's number

>`O`

~~~python
j = 0
for line in O:
~~~

>`OA`

~~~python
OA[j] = table[line]
j += 1
~~~

>`OC`

~~~python
if not "OC" in table[line]:
    table[line]["OC"] = 1
else:
    if table[line]["OC"] == 1:
        table[line]["OC"] = 2
    else:
        table[line]["OC"] = "many"
~~~

>`OLNO`

~~~python
# for line in O:
table[line]["OLNO"] = j
~~~

All steps combined:

~~~python
j = 0

for line in O:
    if not line in table:
        table[line] = {}
        table[line]["OC"] = 1
    else:
        if not "OC" in table[line]:
            table[line]["OC"] = 1
        elif table[line]["OC"] == 1:
            table[line]["OC"] = 2
        else:
            table[line]["OC"] = "many"
    table[line]["OLNO"] = j  # Gets overwritten with multiple occurrences.
    # Check to see if this is the intended implementation.
    # Maybe only relevant for "OC" == "NC" == 1
    OA[j] = table[line]
    j += 1
~~~

#### Third Pass ####

* a. We use *Observation 1*:

    >If a line occurs only once in each file, then it must be the *same* line, although it may have been moved.
    >
    >We use this observation to **locate** unaltered lines that we subsequently **exclude** from further treatment.

* b. Using this, we only process the lines where `OC == NC == 1`.

* c. As the lines between `O` and `N` "must be the *same* line, although it may have been moved", we alter the `table` pointers in `OA` and `NA` to the number of the line in the other file.

* d. <strong>We also locate unique virtual lines 

    * immediately before the first and
    * immediately after the last

    lines of the files</strong> (???)

>b. Using this, we only process the lines where `OC == NC == 1`.

~~~python
if "OC" in NA[i] and "NC" in NA[i]:
    if NA[i]["OC"] == NA[i]["NC"] == 1:
~~~

>c. As the lines between `O` and `N` "must be the *same* line, although it may have been moved", we alter the `table` pointers in `OA` and `NA` to the number of the line in the other file.

~~~python
"""cf. the paper's ex. description of Pass 3."""
olno = NA[i]["OLNO"]

NA[i]    = olno
OA[olno] = i
~~~

>d. <strong>"Find" unique virtual lines immediately before the first and immediately after the last lines of the files</strong> (???)

All steps combined:

~~~python
i = 0

for i in NA:
    if "OC" in NA[i] and "NC" in NA[i]:
        if NA[i]["OC"] == NA[i]["NC"] == 1:
            olno = NA[i]["OLNO"]
            NA[i], OA[olno] = olno, i
    i += 1
~~~

#### Fourth Pass ####

* a. We use *Observation 2*:

    >If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the *same* line. This information can be used to find blocks of unchanged lines.

* b. Using this, we process each entry in **a**scending order.

* c. If 

    * `NA[i]` points to `OA[j]`, and
    * `NA[i+1]` and `OA[j+1]` contain identical `table` entry pointers

    then

    * `OA[j+1]` is set to line `i+1`, and
    * `NA[i+1]` is set to line `j+1`

>b. Using this, we process each entry in **a**scending order.

~~~python
# ascending
for i, j in NA, OA:
~~~

>c. If (...) then (...)

~~~python
if NA[i] == OA[j] and NA[i+1] == OA[j+1]:
    OA[j+1] = table[O[i+1]]
    NA[i+1] = table[N[j+1]]
~~~

#### Fifth Pass ####

Similar to fourth pass, except:

* It processes each entry in **de**scending order
* It uses `j-1` and `i-1` instead of `j+1` and `i+1`

>It processes each entry in **de**scending order

~~~python
# descending
for i, j in NA, OA:
~~~

>It uses `j-1` and `i-1` instead of `j+1` and `i+1`

~~~python
if NA[i] == OA[j] and NA[i-1] == OA[j-1]:
    OA[j-1] = table[O[i-1]]
    NA[i-1] = table[N[j-1]]
~~~

#### Final Pass ####

At this point following our five passes, we have the necessary information contained in `NA` to tell the differences between `O` and `N`.

This pass uses `NA` and `OA` to tell when a line has changed between `O` and `N`, and how far the change extends.

##### Determining a New Line #####

Recall our initial description of `NA` in which we said that the array has either:

>one entry for each line of file `N` containing either
>    * a pointer to `table[line]`
>    * the line's number in file `O`

Using these two cases, we know that if `NA[i]` refers to an entry in `table` (case 1), then line `i` must be new.

We know this, because otherwise, `NA[i]` would have contained the line's number in `O` (case 2), if it existed in `O` and `N`.

##### Determining the Boundaries of the New Line #####

We now know that we are dealing with a new line, but we have yet to figure where the change ends.

Recall *Observation 2*:

>If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.

If `NA[i]` points to `OA[j]`, but `NA[i+1]` does not point to `OA[j+1]`, then line `i` is the boundary for the alteration.

You can look at it this way:

~~~
i  : The quick brown fox      | j  : The quick brown fox
i+1: jumps over the lazy dog  | j+1: jumps over the loafing cat
~~~

Here, `NA[i] == OA[j]`, but `NA[i+1] != OA[j+1]`. This means our boundary is between the two lines.

##### Output Diff #####

At this point, we know all we need to know and have all the relevant logic in place to output the diff(erence) between `O` and `N`.


[paper]: http://scholar.google.dk/scholar?q=10.1145%2F359460.359467&btnG=&hl=da&as_sdt=0
Cacher is the code snippet organizer for pro developers

We empower you and your team to get more done, faster

Paul Heckel's Diff Algorithm