Rob Williams is a probabilistic Lean coder of Java and Objective-C. Rob is a DZone MVB and is not an employee of DZone and has posted 170 posts at DZone. You can read more from them at their website. View Full User Profile

Embedding JGit: A First Look

09.26.2011
| 6056 views |
  • submit to reddit

Couple of years ago I needed a spider to do some excavation of data for an analytics project. I found a project called JSpider that seemed great, and hey, the open source credo is 'use what's there, don't recreate the wheel.' Well, that didn't turn out so well: the thing was a total hassle. I downloaded the source and to my shock and horror found that it was pre Java 5. So I wrote my own spider. Since then, I have had many occasions to consider what does a spider do? It has to extract links from pages to keep going, so it has its own needs, but what does it do? Recently, I have been working on a scraper, knowing that I would marry it to my spider when I was done. Because my conclusion is that in general, spiders should focus on the discovery portion of the problem: following threads and unearthing the underlying topological logic, and the work of actually getting things from the page should be done elsewhere, by either scrapers (if we have specific items we mean to remove from the tangled web catacombs), or simple indexers if we are going to just expose our findings to search. I will probably blog some more about the architecture of spidering later, it's an interesting case of being able to design for clear distinctions of responsibilities, that can be extended.

Meantime, one day when thinking about my spider some time ago, the idea dawned on me that perhaps a missing piece in the spider landscape is versioning. Conceptually speaking, an interesting question is how can a spider purport to discover things if it doesn't know what it's seen before? In fact, in implementing spiders, a seen list is a must. This is kind of taking it to the next diachronic dimension: there's really no reason to reindex or rescrape the page if nothing has changed on it.

At first I was thinking about just implementing something like a checksum, but then I thought, that's pretty stupid. I also thought that there could be real value in maintaining diffs and history for all the pages on the site.

So I went and got JGit.

The acquisition part was fairly smooth, though, the downloads page makes it seem like they have a maven repository, when in fact, they don't. (That was a good way to start my open source adoption journey: it had kind of the feeling of Vor Dem Gesetz to it: you know the thing exists, and it's there, but you are sent to a door that doesn't exist. Then my trusty friend Nexus intervened and it turned out that the jgit source was in the jboss public repo. The version numbers were different, and long and ugly, but hey, the dependency inclusion went pretty quickly and I was able to import the classes.

As is often the case with libraries like this, there is a low level version of the API, and then a higher one (named porcelain). I wanted to write a unit test that would show that I could create a repository, check a file in, and then get a log that shows that these events really happened.

One of the funniest discoveries while working on this piece was that the @Rule annotation in *JUnit*, which is first off, incredibly stupidly named, because it really does not make you think of a way to remove a temporary file, but also, how stupid is it that you can't tell it to not delete the temp so you can get your code working, inspecting what you have, then take off the delete=false or whatever, and have the test pass?? So I ended up having to write these tests twice: once to a folder off the project root where I could see what it was doing, and then again to the temporary folders. The calls are not that different, but there are enough differences that you can't run the same code on both.

Overall, this was pretty painless, and I think now that there is zero question about the consensus winner in the repository space, using git for all kinds of other things, oh, and btw, a pure java implementation of it, is a no brainer.

package com.ontometrics.spider.repository;

import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.is;
import static org.hamcrest.Matchers.notNullValue;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Iterator;

import org.eclipse.jgit.api.Git;
import org.eclipse.jgit.api.errors.ConcurrentRefUpdateException;
import org.eclipse.jgit.api.errors.JGitInternalException;
import org.eclipse.jgit.api.errors.NoFilepatternException;
import org.eclipse.jgit.api.errors.NoHeadException;
import org.eclipse.jgit.api.errors.NoMessageException;
import org.eclipse.jgit.api.errors.WrongRepositoryStateException;
import org.eclipse.jgit.revwalk.RevCommit;
import org.eclipse.jgit.revwalk.RevWalk;
import org.eclipse.jgit.storage.file.FileRepository;
import org.eclipse.jgit.storage.file.FileRepositoryBuilder;
import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TemporaryFolder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class RepositoryTest {

	private static final Logger log = LoggerFactory.getLogger(RepositoryTest.class);

	@Rule
	public TemporaryFolder fileFolder = new TemporaryFolder();

	private File repositoryFolder;

	@Before
	public void setup() {
		repositoryFolder = fileFolder.newFolder(".git");
	}

	@Test
	public void canCreateNewRepository() throws IOException, NoHeadException, NoMessageException,
			ConcurrentRefUpdateException, JGitInternalException, WrongRepositoryStateException, NoFilepatternException {

		FileRepository repository = new FileRepositoryBuilder().setGitDir(repositoryFolder).build();
		log.info("dir: {}", repository.getDirectory());

		repository.create();
		Git git = new Git(repository);
		Git.init().call();

		RevWalk walk = new RevWalk(repository);
		RevCommit commit = null;

		File exampleHtml = new File(fileFolder.getRoot().getPath() + "examplePage.html");
		exampleHtml.createNewFile();
		FileWriter out = new FileWriter(exampleHtml);
		out.write("<html>");
		out.write("<table>");
		out.write("</table>");
		out.write("</html>");
		out.close();

		git.add().addFilepattern(".").call();
		git.commit().setMessage("Simple html file.").call();

		Iterables<RevCommit> logs = git.log().call();
		Iterator<RevCommit> i = logs.iterator();

		while (i.hasNext()) {
			commit = walk.parseCommit(i.next());
			log.info(commit.getFullMessage());

		}

		assertThat(repository, is(notNullValue()));

		repository.close();
	}

}


(Insane that we can't simply format code on here a decade later...)

The main thing to note is that you want to make a different directory for the repository than for the file that is going to contain the files you will be wanting to version.

The next step will be to have the spider's page processor do a diff on each page that it has encountered before and if there are differences, enter a new version. Then the question becomes how do upstream consumers get notified. The natural response would be they could subscribe to be notified of changes (Observer). I read an interesting if showy and opaque article the other day from some *Scala* cat about how *Observer* ought be deprecated. Thought it was pretty weak really. My Butthead reading was 'words, words, words 'when used wrong, the results with Observer are suboptimal..,' words, words....'

 

From http://www.jroller.com/robwilliams/entry/embedding_jgit_a_first_look

Published at DZone with permission of Rob Williams, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: