Sunday, March 13, 2011

Merging Branches with SVN

When I read things like "the world's most popular open source version control system" and (paraphrased) "designed to fix to fix CVS's problems" I don't think "took important steps backwards" but I find aspects of SVN have done just that. In particular, merging tags/branches becomes a challenge when the merge source has multiple revision anchors.

As expected, one can accomplish simple merges quickly using explanation of commands available in the SVN Red-bean Book. However, if you're working with multiple committors, or even one or two folk who liberally tag/branch, you'll run into unexpected results quickly. The problem stems from the fact that SVN's merge command does not automatically resolve the previous branch points for merge on a file-by-file basis. As such, when you give a
svn merge -r<source>:<target> ... 
command, you may not be giving SVN enough information to do the correct thing. This is particularly the case if you or your fellow committors do things like merging in bulk from the project's root directory.

Consider an example:

(*)         - r1 User A
 |
(*)         - r2 User A
 |  \
 |    \ 
 |     (*)  - r3 User B
(*)     |   - r4 User C
 |   \  |
 |     (*)  - r5 User A 
 |     /
 |   /
 (?)

In the case visualized above, we have two users committing code to two branches. Remember, SVN uses a single global (to the repository, not the project) incrementally increasing value to represent revision. In the case represented, A user (A) commits twice (r1 and r2) and then a second user (B) branches (r3).

Subsequent to that branch, A continues developing while C commits to what we'll refer to as the 'trunk'. Then, maybe because of a bug fix or lunch discussion, A commits to the branch. What we have, in commit #5, is a situation in which the files/changes (r4) were drawn from the trunk before modifications were made.

At question: how does A merge trunk and branch with 1) least amount of pain and 2) save the highest resolution of change meta-data (and thus make subsequent situations like the r4-->r5 merge less painful)?

While SVN supports multiple approaches to branching/merging, I've found the solution to this problem that optimizes #1 and #2 from above involves a crucial extra step. Consider the following method:
  1. Make sure your own branch is up to date
  2. Determine the revision from which source material forked from target
  3. Construct a merge command based on the computed source revision

The step people skip is #2.  In our example above, merging from the branch represented by commits r3-4 require you to specify r3 as the source of merge material while merging files changed by commit r5 from the branch to the trunk requires specifying r5 to preserve the later merge meta-data (*1). To make this more concrete, imagine that A has a checkout of the trunk, currently at r1 that (s)he intends to update to r5 to reflect the branch's changes. Let's follow the process.
  1. Conducting an update will pull down files from r4, changed in the trunk
  2. Iterating through remaining files in the source tree, two sets of source anchors will be reported: Set X (r2) & Set Y (r5)
  3. Conducting the merge requires issuing two merge commands, each on its respective set from the previous bullet:
    svn merge -r2:HEAD <'branch' URL> <'trunk' path to Set X>
    svn merge -r3:HEAD <'branch' URL> <'trunk' path to Set Y>
    
If committors confine Set X to a single sub-directory, then the commands indicated in the last list item can be issued as they're parametrized: as single <'branch' src> <'trunk' target> tuples. However, if changed files spread across sub-directories, developers conducting merges will have to issue multiple commands, each specifying specific 'branch' source URLs and target paths. Yes, this frustrates everybody involved. The up-shot? as the person conducting the merge, you move slowly and methodically through a merge, understanding changes to each file / directory explicitly (especially where changes have fractured themselves across directories). This has also caused me and my development lead to merge between branches more often than on previous projects--causing us to remain more in sync with each other.

I provide two tools to make the process easier. First, a simple script to accomplish merge step #2: determine the revision at which the source material forked:

 #!/bin/sh


svn log --stop-on-copy | grep '^r' | tail -n 1 | cut -f 1 -d ' ' | cut -f 2- -d 'r'

Name this file something like SVN_determine_revision_anchor and pass it the file / directory of which you desire to know the last branch point.
If you don't like calling the merge command manually (long URL paths can make this a pain), use something like the following, which I named merge.sh:

 
#!/bin/sh


URL_PREFACE="https://svn.myorg.org/svn/repos/dev/myapp"

URL_SUFFIX="current"


TO_MERGE="${URL_PREFACE}/$1/${URL_SUFFIX}"

TARGET="${URL_PREFACE}/$2/${URL_SUFFIX}"

SUPPLYING_BRANCH_REV="`svn log --stop-on-copy ${TO_MERGE} | grep 
"r[0-9]" | tail -n 1 | cut -f 1 -d '|' | cut -f 2 -d 'r' | sed 's/ //'`"

TARGET_BRANCH_REV="`svn info ${TARGET} | grep Revision | cut -f 2- -d ':' |sed 's/^[ ]*//'`"


echo "Merging ${TO_MERGE}@${SUPPLYING_BRANCH_REV} with ${TARGET}@${TARGET_BRANCH_REV}"

svn merge -r${SUPPLYING_BRANCH_REV}:${TARGET_BRANCH_REV} ${TO_MERGE}

Call this script with two parameters: first the source of the merge information and the second the target tag. You'll note, in essence, that this second script incapsulates the functionality of the first. 

Summary
When conducting merges with SVN, please consider the source revision and forks carefully (on a file-by-file basis if necessary). While other mechanisms may work to merge things without this step, combining multiple forks later will likely create unexpected conflicts and difficulty.

(*1) To me, this represents defeat on the part of the version control system. What purpose should a version control system serve if not to keep track of this very branch information for use in resolving merge scenarios?

Tuesday, March 1, 2011

Ensuring Super Class Initialization

Seemingly a very simple concept: how do you guarantee that when someone sub-classes your Python class that your constructor ( __init__() ) runs? The straightforward method contains a trap:
  

class Base(object):  
   def __init__(self):
      print "foo"

class Sub(Base):
   def __init__(self): 
      print "bar"
When the Sub class' initialize method is executed it squashes the Base class'. What we want is for both to be called:

  
class Sub(Base):
   def __init__(self):
      Base.__init__(self)
      print "bar"


Unfortunately, we can't make the person extending our class call the super class. A poor man's factory pattern  alleviates the possibility of subtype implementers forgetting initialization. Consider:

class AbstractClass(object):
    '''Abstract base class template, implementing factory pattern through 
       use of the __new__() initializer. Factory method supports trivial, 
       argumented, & keyword argument constructors of arbitrary length.'''

   __slots__ = ["baseProperty"]
   '''Slots define [template] abstract class attributes. No instance
       __dict__ will be present unless subclasses create it through 
       implicit attribute definition in __init__() '''

   def __new__(cls, *args, **kwargs):
       '''Factory method for base/subtype creation. Simply creates an
       (new-style class) object instance and sets a base property. '''
       instance = object.__new__(cls)

       instance.baseProperty = "Thingee"
       return instance

This base class can be extended trivially, using only three (3) lines of code san-commment, as follows:
class Sub(AbstractClass):
   '''Subtype template implements AbstractClass base type and adds
      its own 'foo' attribute. Note (though poor style, that __slots__
      and __dict__ style attributes may be mixed.'''

   def __init__(self):
       '''Subtype initializer. Sets 'foo' attribute. '''
       self.foo = "bar"


Note that though we didn't call the super-class' constructor, the baseProperty will be initialized:


Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from TestFactory import *
>>> s = Sub()
>>> s.foo
'bar'
>>> s.baseProperty
'Thingee'
>>> 

As its comment indicates, the base class AbstractClass need not use slots, it could just as easily 'implicitly' define attributes by setting them in its new() initializer. For instance:

instance.otherBaseProperty = "Thingee2"

would work fine. Also note that the base class' initializer supports trivial (no-arg) initializers in its subtypes, as well as variable-length arugmented and keyword argument initializers. I recommend always using this form as it doesn't impose syntax in the simplest (trivial constructor) case but allows for the more complex functionality without imposing maintenance.