Thursday, March 14, 2013

Measuring Project Activities (1)

Earlier, I showed a handful of metrics to view the level of activities in Git project, grouped by its release cycle, and promised to expllain how you can compute similar numbers for your projects.

This is the first of such posts. This post covers the very basics.

How long did a cycle last?

Each release is given a release tag. The latest I tagged for Git project was v1.8.2 and the release before that was v1.8.1. The release cycle began when I tagged v1.8.1 and ended when I tagged v1.8.2. As each commit in Git records commit timestamp and author timestamp, we can use diffrence between the commit timestamps of the two release.

We can ask git log to give us the timestamp for one commit:

  git log -1 --format=%ct $commit

The --format= option lets us ask for various pieces of information, and %ct requests the committer timestamp, expressed as number of seconds since midnight of January 1, 1970 (epoch). You can use %ci for the same information but in ISO 8601 format, i.e. YYYY-MM-DD HH:MM:SS; see git log --help and look for "PRETTY FORMATS" for other possibilities.

So the part, given two commits, that computes the number of days between them, becomes something like this:

handle () {
  old="$1^0" new="$2^0"
  oldtime=$(git log -1 --format=%ct "$old")
  newtime=$(git log -1 --format=%ct "$new")
  days=$(( ($newtime - $oldtime) / (60 * 60 * 24) ))
  ...
}

We ask the commit timestamps for the two commits in seconds since epoch, take the difference, and divide that by number of seconds in a day.

How many commits do we have in the cycle?

This is a single-liner.
git log has a way to list commits in a specified range, and the range we want can be expressed as: "We are interested in commits that are ancestors of v1.8.2, but we are not interested in commits that are ancestors of v1.8.1" (as the latter is the set of commits that happened before v1.8.1).

In a merge-heavy project like Git, however, merge commits make up a significant part of the history. A logical change that consists of three patches may start its life as three commits on a topic branch to be tested, and later when it proves to be sound gets merged to the mainline with a merge commit, at which point the mainline gains 4 commits (the original three plus the merge commit). That means the real change is only 75% of the history in the example.

Of course, merging other people's work is an important part of the work done in the project, so you may want to count merge commits as well. The choice is up to you.

When I counted commits for Git project in the earlier article, I chose not to include merges, so the part that computes the number of commits between two given commits becomes:

handle () {
  old="$1^0" new="$2^0"
  ...
  commits=$(git log --oneline --no-merges "$old..$new" | wc -l)
  ...
}

Drop --no-merges if you want to count your merges. The --oneline option is to show a single line of output per commit; by counting the lines in the output from that command with wc -l, we can count the number of commits.

As we are not interested in the contents of the output (we are just counting the number of lines), we can also use git rev-list that only shows the commit object name, if you want.

  commits=$(git rev-list --no-merges "$old..$new" | wc -l)

How many contributors did we have in the cycle?

You can list the names and e-mails of people who authored commits in a specified range in two ways.

Using the git log --format we saw earlier, we can ask the name %an and e-mail %ae of the author, i.e.

  git log --no-merges --format="%an <%ae>" "$old..$new"

You can count the unique lines in the output from this command. That is the list of your contributors. The end result will become something like this:

  authors=$(git log ...the same as above... | sort -u | wc -l)

The other way is to use the git shortlog command designed specifically for this purpose.

  git shortlog --no-merges -s -e "$old..$new"

The command without -e option only shows the names (and with it, names and e-mails). It lists commits made by each author along with the author name when run without -s option (and with it, the number of commits and the author's name on the same line). So the number of lines in the output from the above command is the number of your contributors.

  authors=$(git shortlog --no-merges -s -e "$old..$new" | wc -l)

Again, if you want to count merges, drop --no-merges from the command line.

How many new contributors have we added during the cycle?

This is a bit trickier than the previous one. The idea is to list contributors we already had in the entire history before v1.8.1, and subtract that from the list of contributors in the entire history up to v1.8.2. The remainder are the newcomers you want to welcome when writing your release notes.

The contributors in the entire history leading to a commit can be listed with a helper function:

authors () {
  git log --no-merges --format="%an <%ae>" "$@" | sort -u
}

and we can write the entire thing using the helper function like so:

handle () {
  old="$1^0" new="$2^0"
  ...
  authors "$old" >/tmp/old
  authors "$new" >/tmp/new
  new_authors=$(comm -13 /tmp/old /tmp/new | wc -l)
  ...
}

The authors helper function will write the authors for the old history and new history into two temporary files, both in sorted order, and using comm -13, we list lines that only appear in /tmp/new to see who are the new contributors.

In the next installment of this series, let's count the changes made to the codebase by these commits we counted in this article.

2 comments:

Anonymous said...

Is there some reason not to use the --count option like so:

commits=$(git rev-list --no-merges --count "$old..$new")

instead of:

commits=$(git rev-list --no-merges "$old..$new" | wc -l)

Junio C Hamano said...

It is just inertia.