Tay Ray Chuan home archive

potential git-diff speed up by skipping is-binary check

Sat, 21 Jan 2012 21:09:45 +0800 | Filed under git

While working on gsoc 2011, I identified the is-binary check as a potential slowdown in my personal notes. (Yes, it's done for every file.) I tried to disable it with a command-line switch to git-diff, but that was ugly and hack-ish.

Recently I was looking at diff codepath again (due to a patch for --word-diff that turned out to be misjudged), so I gave this another shot.

In diff.c, we find diff_filespec_is_binary():

int diff_filespec_is_binary(struct diff_filespec *one)
{
	if (one->is_binary == -1) {
		diff_filespec_load_driver(one);
		if (one->driver->binary != -1)
			one->is_binary = one->driver->binary;
		else {
			if (!one->data && DIFF_FILE_VALID(one))
				diff_populate_filespec(one, 0);
			if (one->data)
				one->is_binary = buffer_is_binary(one->data,
						one->size);
			if (one->is_binary == -1)
				one->is_binary = 0;
		}
	}
	return one->is_binary;
}

On L11, we call xdiff's buffer_is_binary(), which tries to find a NUL with memchr(). Looking at the extract, we see that it is possible to skip the is-binary check by defining a diff driver and setting diff.<driver>.binary.

(You might have noticed the dozen-or-so diff drivers for programming languages that git has built-in, but those won't do, because they are "undecided" about binary-ness, ie. driver.binary=-1.)

It turns out that we don't have to define our own custom diff driver. In userdiff.c, we have the built-in "dummy" driver driver_true that has driver.binary=0 (false). We can turn this on by setting the diff attribute in .gitattributes, like this:

# glob [attr1 [attr2 [...]]]
* diff

However, some light testing shows the gains are not worth the trouble. Running time git log -p v0.99 >/dev/null in the git repo itself (which has 1075 commits), here are the best of 5 numbers on a Solaris machine in NUS:

Oh well.

blog comments powered by Disqus