NoSQL Zone is brought to you in partnership with:

Tom is a DZone MVB and is not an employee of DZone and has posted 1 posts at DZone. View Full User Profile

Programmable Diff

01.21.2014
| 6615 views |
  • submit to reddit

Programmable Diff

From time to time, I get down deep inside the NuoSQL compiler and give it a good hard shake. That's generally good for the compiler, but it can make comparing the new compiler to the old compiler's results an exercise in wading through a sea of small differences, some consequential, some not. Computers are much better at that than I am, so I would like to use a diff tool that's extremely configurable. PHP has array-udiff, and Python has difflib, but I wanted a tool written in Java, and I also wanted to learn more about how diff works. I develop dynamic programming algorithms building compilers, and I know many diff algorithms are based on dynamic programming. Should be fun!

A Quick Tour

The DiffEngine<T extends Comparable> code in our github repository is the foundation of this tool. DiffEngine<T> compares two List<T> instances; its getDifferences() method returns a List<Difference> of the element-by-element edit operations that will transform the first list into the second list. DiffEngine.coalesceRegions() transforms the element-by-element list into output that looks more like a traditional diff.

The FilterByRegex tool takes paths to two files to be compared, and a path to a pattern file:

java -cp classes com.nuodb.diff.FilterByRegex testResults.expected testResults.actual -p patternFile

The pattern file is a list of regular expressions with capturing groups, such as:

\s*(?:\w)+\s+(?:\||#)\s*(.*)\s*\(\d+\)\s*

That bit of gobbledygook finds lines from a test that look like this:

 main |           This is the interesting part (2)

and captures "This is the interesting part" in the capturing group, (.*) in the pattern. If a line in either file to be compared matches a pattern, FilterByRegex constructs an abstract of the original line that only contains the contents of the regex's capturing groups -- abstracting the line above to "This is the interesting part". (If the line doesn't match any regex, then the line's abstract is the line's original contents.) The diff engine then compares lines based on their abstract, so these lines would all be considered equvalent:

 main |           This is the interesting part (2)
 main #           This is the interesting part (2)
 main |           This is the interesting part (12)
 zort #     This is the interesting part (943344983)

The Diff Algorithm

The diff engine is surprisingly simple. Like most diff algorithms, it is based on a solution for the Longest Common Subsequence problem, a.k.a. "LCS." The LCS of two sequences is the longest collection of elements that can be found in order in both sequences. For example, for the two sequences:

ABCDEFG ACXDKEVG

the longest common subsequence is A-C-D-E-G. As you can see, the elements in the LCS are not necessarily adjacent. There may be more than one LCS: for example, for the two sequences: ABCD ACBD A-B-D and A-C-D are both valid "longest" common subsequences. The Wikipedia article on Longest Common Subsequence has a good overview.

Finding the LCS of the two inputs gets us a diff because the LCS is a close mathematical relative of the "edit distance," a measure of how many discrete changes are required to transform the first sequence into the second. That's starting to sound pretty diff-like.

Computing the LCS

The algorithm to compute the LCS works by processing successive prefixes of the input sequences. In brief, for any two prefixes A(1..i) and B(1..j) of the original sequnces A and B, we compare the final elements, A[i] and B[j]. If they are equal, then the LCS of these two prefixes is one element longer than the LCS of the prefixes A(1..i-1) and B(1..j-1); if the final elements are not equal, then the LCS is the larger of the LCS of the prefixes A(1..i-1),B(1..j) or A(1..i),B(1..j-1). Like many string algorithms, this algorithm uses 1-based indexing; the LCS is defined to be 0 when i or j is 0. The many overlapping computations of the prefixes' LCS can be recorded in an array of subsolutions, which yields an algorithm based on dynamic programming:

    lcs = new int[sequenceA.size()+1][sequenceB.size()+1];

    for (int i = 0; i < sequenceA.size(); i++) {

        for (int j = 0; j < sequenceB.size(); j++) {

            if (sequenceA.get(i).compareTo(sequenceB.get(j)) == 0) {

                // The subsequence extends to the ith and jth 
                // positions of the respective strings.
                lcs[i+1][j+1] = lcs[i][j] + 1;

            } else {

                // The longest subsequence at the ith and jth
                // positions is the longest of the two subsequences
                // of the prefixes sequenceA[1..i-1], sequenceB[1..j-1].
                lcs[i+1][j+1] = Math.max(lcs[i][j+1], lcs[i+1][j]);
            }
        }
    }

Converting the LCS Into a Diff

Converting the LCS into a diff is done by starting at the end of the lcs array and backtracing through the array to construct the series of element-by-element editing operations that transform sequence A into sequence B. The algorithm starts with

    int i = sequenceA.size();   // Recall that this algorithm uses 1-based addressing
    int j = sequenceB.size();

and works back through the sequences and the lcs array. For any values of i and j between their initial values and 0, one of these conditions is true:

  • A[i] = B[j].
    • In this case, the element is common to both sequences and the algorithm continues on to examine A[1..i-1] and B[1..j-1].
  • j is greater than zero, and i is zero.
    • This means sequence B has a prefix that did not appear in the common subsequence. The algorithm adds an "insert B[j]" editing operation to the series of editing operations, and continues by examining A[1..i] and B[1..j-1].
  • Similarly, if i is greater than zero and j is zero, then sequence A has a prefix that did not appear in the common subsequence. The algorithm adds a "delete A[i]" editing operation, since its goal is to generate editing operations to transform sequence A into sequence B, and continues by examining A[1..i-1] and B[1..j].
  • If i and j are both greater than zero, then the algorithm has to emit an editing operation and continue with one of the pairs of prefixes A[1..i],B[1..j-1] or A[1..i-1],B[1..j]. The algorithm picks its prefixes based on the lcs matrix; the prefix pair with the longest lcs is the prefix pair that will eventually produce the least number of editing operations. If both pairs of prefixes have the same lcs, then the choice is arbitrary.
    • The editing operation emitted depends on which prefix pair was chosen.
      • Choosing the prefix pair A[1..i],B[1..j-1] means the algorithm has discarded an element of sequence B; since the goal is to emit editing operations to transform sequence A into sequence B, this adds a "insert B[j]" editing operation.
      • Likewise, continuing with A[1..i-1],B[1..j] adds a "delete A[i]" editing operation.
Published at DZone with permission of Tom Harwood, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)