Code Poetic: 2011

Friday, May 6, 2011

Private GIT Repository Over SSH w/ SSH-KEY and git-shell

There's documentation out there on using git, on git over SSH, and a wealth of information on using SSH keys for single-command purposes in Chapter 8 of O'Reilly's SSH book. However, if you're trying to build a private git repository:

On a Mac (OS X) server
Want the user to maintain a normal shell (IE: don't want to use git-shell as the user's shell)
Can't install gitosis

You're going to need to do things slightly different than are documented out there. ***if you read further I presume you have a good reason to forsake gitosis because all things considered it's a good solution: it's very easy to install, customize, or even extend as source code. Once set up it doesn't impose much in the way of dependencies, maintenance, or overhead.

Challenges
Apple seems to change command-line user addition commands on OS X between minor versions. Existing blog entries thus create silent failures when you follow their instructions. The reason is subtle: failing to create the correct user directory elements for a new user will result in clients attempting git [fetch|pull] commands remotely to receive an error message:

bash-3.2$ fatal: The remote end hung up unexpectedly

The following will be written to the OS X server system.log:

sshd(28534) deny mach-per-user-lookup

I've seen these errors result from the user lacking a home directory, from the git repository not having the correct group, and from other circumstances. Because no further error/debug information is given, this can be very frustrating to debug.

Solution
supporting ssh-based access to your private git repository through a single-use key requires work on both client and server. First server. git-shell allows you to support cloning, push, pull, and fetch to and from a server, using an ssh-key, while still supporting normal user login on your sever for other purposes. Additionally, you need not open ports beyond what you've already carved for SSH in your server.

Server Set-up

Change directory to server-user's account:

 [ -d ~/.ssh ] ||  mkdir ~/.ssh; chmod 700 ~/.ssh && cd ~/.ssh/

Create new SSH key (*Be sure to set parameters as your needs warrant)

ssh-keygen -b 2048 -f "`whoami`@`hostname`-GIT-SHELL-`date "+%Y-%m-%d"`"

Add this new key as a 'single-use' key to the authorized keys file:


[ -e ./authorized_keys ] || touch authorized_keys && chmod 600 authorized_keys

echo -n "command=\"/usr/bin/git-shell -c \\\"\$SSH_ORIGINAL_COMMAND\\\"\",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty >> authorized_keys

cat ./`whoami`@`hostname`-GIT-SHELL-`date "+%Y-%m-%d".pub` >> authorized_keys

In case you have trouble resolving the escaped characters in the above lines: the output should look something like this:


bash-3.2$ more authorized_keys 
command="/usr/bin/git-shell -c \"$SSH_ORIGINAL_COMMAND\"",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-rsa AAAAB3...KdeQ== bach@Sanguine.local

Client Setup
Remember to use your public key for your authorized keys file server-side, and then securely transport the private key to your client users/machines. This represents a departure from what some administrators are used in other circumstances, and I've seen keys compromised on more than one occasion as a result.

Validate the user's ~/.ssh directory has permissions 700 (IE: drwx------)

If not, chmod 700 ~/.ssh

Copy the private key created above into the user's ~/.ssh directory
Verify the private key has permissions 600 (IE: -rw-------)

If not, chmod 600

Create a ~/.ssh/config file, if one doesn't exist:

[ -e ~/.ssh/config ] || touch ~/.ssh/config && chmod 600 ~/.ssh/config

Create a host configuration: (configure each host parameter to your needs)

host mySource
        HostName        myHost.org
        Port            22
        User            git
        Compression     yes
        Protocol        2
        IdentityFile    ~/.ssh/git@myHost.org-GIT-SHELL-2011-05-02

Execute your clone, fetch, or pull commands, using the host specifier as the host name in the URL:

bash-3.2$ git clone ssh://mySource/repository_path/repository.git

The configuration file appears necessary for correct client operation. Specifying the username, host, and port on the command line to a git command such as follows:

bash-3.2$ git clone ssh://git@mySource:22/repository_path/repository.git

does not work, and produces the otherwise silent "Fatal Error" discussed earlier. The git remote set-url or git remote add commands can be used as well, their hostname element of the URL must only match the host specifier in the ~/.ssh/config file (in our case mySource). For example, a user might execute the following command:

git remote add repo ssh://mySource/repository_path/repository.git

so that they can issue commands as follows subsequently:

git fetch repo master

Considerations
The above solution allows you to allow git-shell access, through a single-use ssh-key to an otherwise normal system user. You can add separate ssh-keys for different client users, though each will adopt the UID of the user under which you install them for local server-side operations.

If you need the capability to do directory-, repository-, or user-based access control, ssh-keys will not provide sufficient fine-grained resolution/control. In this case, you'll want to go directly to gitosis.

Friday, April 15, 2011

Adding Maven Dependencies (for Google App-Engine)

Maven provides a decent system for managing builds involving gnarly dependencies. If you're starting from scratch with an app-engine project, just go here. However, newer developers and clients with which I work complain consistently that difficulty using maven increases substantially when an archetype is chosen but developers want to absorb another largish development or deployment framework into the same build. Prognosis for static analysis isn't good in these cases either.

Challenges
Though many SpringSource frameworks work well on Google's app-engine without change, I found blog postings on adding gae support to an existing Maven pom.xml lacking. I had to field emails when "stuff didn't just work". In particular, two nagging problems arouse: 1) which framework jars did builds depend on and 2) blog entries providing clean cut-and-paste solutions were all out of date.

No sooner had I shown someone how to augment a maven POM than Google upgraded their app-engine. Sure enough, I passed them code and it didn't work. They had two versions of gae on their system but maven was building against the old version of the app-engine.

Another individual fought an issue because they cut-and-paste a scheme from a well-formulated (but broken) blog entry on the topic. Specifically, maven's behavior, when groupId, artifactId tuples are repeated for multiple files, is to silently overwrite existing maven repository data with the newly supplied file. This means:


mvn install:install-file -Dfile=${LIB}/appengine-tools-api.jar \
  -DgroupId=com.google \
  -DartifactId=appengine-tools \
  -Dversion=${VERS} \
  -Dpackaging=jar \
  -DgeneratePom=true
 
mvn install:install-file -Dfile=${LIB}/shared/appengine-local-runtime-shared.jar \
  -DgroupId=com.google \
  -DartifactId=appengine-tools \
  -Dversion=${VERS} \
  -Dpackaging=jar \
  -DgeneratePom=true

Does not do what you want it to. Instead of having two jars included in your build as part of the com.google.appengine-tools dependency, you get only the runtime-shared jar.

Finally, those beset by the need to do things themselves often need to look up several XSLT syntactic constructs to get correct transforms capable of scraping a POM for all the dependencies that may need upgraded, removed, or otherwise tweaked. This is particularly the case because matching a POM dependency requires a multi-element match, governed in the case of my code by parameters passed to the sheet

Solution
On the POM side of things, you want to look for scenarios where an old version of a dependency exists, such as below:

<dependency>
 <groupid>com.google</groupid>
 <artifactid>appengine-api</artifactid>
 <version>1.4.2</version>
 <scope>compile</scope>
</dependency>

Note the explicit version number. What's preferable, is managing your dependencies in a parametrized fashion. Comparing x.y.z-format version numbers is obnoxious. Chasing down each related dependency in a file to check version number is as well. Because of these (amongst other) reasons, a more maintainable idea involves parameterizing these as references and changing that single instance on upgrade.

For instance:


<dependency>
        <groupid>com.google</groupid>
        <artifactid>appengine-api</artifactid>
        <version>${google.app-engine.version}</version>
        <scope>compile</scope>
</dependency>

Followed, of course, by:


  <properties>
    <google.app-engine.version>1.4.3</google.app-engine.version>
  </properties>

Since I'm building kit to help security folk work with and inject secure snippets into existing development projects, I decided to build a utility to help folk get things up and running on gae too (the average familiarity with Maven in security isn't a high as developers' typically) by doing the above. The utility wrote to handle these situations does the following:

Parses an existing pom.xml looking for existing out-dated or half-working dependencies
Iterates through specified dependencies, modifying the POM

Adds unmentioned groupId, artifactId tuples
Modifies existing groupId, artifactId tuples
Comments out replaced collisions for later inspection
Points out collisions for further inspection
Controls version references using properties, which are also added/modified as necessary

Installs listed dependency files in the user's local maven repository

Ideally, proper version parsing, comparison, and collision detection would be possible but that would take real time. The script supplied here can be used with any dependency, not just Google's app-engine. In fact, manage any dependency you expect to change with regularity, if your project uses a different archetype already. Prerequisites show how.

Pre-Requisites
Pre-requisites are part of the problem here: a snag people run into when they're new to build configuration has always been that they're missing some tool they need to get things compiling. This was true of gmake and autoconf and it's true of maven builds nearly as often in my experience. So, when I wrote this utility, I purposefully used only bash, mvn, xsltproc, javac, and sed. All of these utilities are available on Linux and OS X out of the box.

When gearing up to use inject-gae-dependency, simply:

Use an existing pom.xml
Modify the lists the dependencies (by editing the script's enumerations)
Download Google's app-engine or to be managed secondarily
Optionally salt-and-pepper to taste

Running
Using the tool is straightforward. Usage works as follows:


./bin/install-google-appengine-dependencies.sh <path to Google app-engine Java SDK>

Status is printed to STDOUT and maven logging occurs to ./mvn.log

Download
Download the current code base from its Google Code Repository or use SVN to do a checkout using URL:

svn checkout http://code-poetic.googlecode.com/svn/trunk/mvn-inject-gae mvn-inject-gae read-only

Sunday, March 13, 2011

Merging Branches with SVN

When I read things like "the world's most popular open source version control system" and (paraphrased) "designed to fix to fix CVS's problems" I don't think "took important steps backwards" but I find aspects of SVN have done just that. In particular, merging tags/branches becomes a challenge when the merge source has multiple revision anchors.

As expected, one can accomplish simple merges quickly using explanation of commands available in the SVN Red-bean Book. However, if you're working with multiple committors, or even one or two folk who liberally tag/branch, you'll run into unexpected results quickly. The problem stems from the fact that SVN's merge command does not automatically resolve the previous branch points for merge on a file-by-file basis. As such, when you give a

svn merge -r<source>:<target> ...

command, you may not be giving SVN enough information to do the correct thing. This is particularly the case if you or your fellow committors do things like merging in bulk from the project's root directory.

Consider an example:

(*)         - r1 User A
 |
(*)         - r2 User A
 |  \
 |    \ 
 |     (*)  - r3 User B
(*)     |   - r4 User C
 |   \  |
 |     (*)  - r5 User A 
 |     /
 |   /
 (?)

In the case visualized above, we have two users committing code to two branches. Remember, SVN uses a single global (to the repository, not the project) incrementally increasing value to represent revision. In the case represented, A user (A) commits twice (r1 and r2) and then a second user (B) branches (r3).

Subsequent to that branch, A continues developing while C commits to what we'll refer to as the 'trunk'. Then, maybe because of a bug fix or lunch discussion, A commits to the branch. What we have, in commit #5, is a situation in which the files/changes (r4) were drawn from the trunk before modifications were made.

At question: how does A merge trunk and branch with 1) least amount of pain and 2) save the highest resolution of change meta-data (and thus make subsequent situations like the r4-->r5 merge less painful)?

While SVN supports multiple approaches to branching/merging, I've found the solution to this problem that optimizes #1 and #2 from above involves a crucial extra step. Consider the following method:

Make sure your own branch is up to date
Determine the revision from which source material forked from target
Construct a merge command based on the computed source revision

The step people skip is #2. In our example above, merging from the branch represented by commits r3-4 require you to specify r3 as the source of merge material while merging files changed by commit r5 from the branch to the trunk requires specifying r5 to preserve the later merge meta-data (*1). To make this more concrete, imagine that A has a checkout of the trunk, currently at r1 that (s)he intends to update to r5 to reflect the branch's changes. Let's follow the process.

Conducting an update will pull down files from r4, changed in the trunk
Iterating through remaining files in the source tree, two sets of source anchors will be reported: Set X (r2) & Set Y (r5)

Conducting the merge requires issuing two merge commands, each on its respective set from the previous bullet:

svn merge -r2:HEAD <'branch' URL> <'trunk' path to Set X>
svn merge -r3:HEAD <'branch' URL> <'trunk' path to Set Y>

If committors confine Set X to a single sub-directory, then the commands indicated in the last list item can be issued as they're parametrized: as single <'branch' src> <'trunk' target> tuples. However, if changed files spread across sub-directories, developers conducting merges will have to issue multiple commands, each specifying specific 'branch' source URLs and target paths. Yes, this frustrates everybody involved. The up-shot? as the person conducting the merge, you move slowly and methodically through a merge, understanding changes to each file / directory explicitly (especially where changes have fractured themselves across directories). This has also caused me and my development lead to merge between branches more often than on previous projects--causing us to remain more in sync with each other.

I provide two tools to make the process easier. First, a simple script to accomplish merge step #2: determine the revision at which the source material forked:


 #!/bin/sh


svn log --stop-on-copy | grep '^r' | tail -n 1 | cut -f 1 -d ' ' | cut -f 2- -d 'r'

Name this file something like SVN_determine_revision_anchor and pass it the file / directory of which you desire to know the last branch point.
If you don't like calling the merge command manually (long URL paths can make this a pain), use something like the following, which I named merge.sh:

 
#!/bin/sh


URL_PREFACE="https://svn.myorg.org/svn/repos/dev/myapp"

URL_SUFFIX="current"


TO_MERGE="${URL_PREFACE}/$1/${URL_SUFFIX}"

TARGET="${URL_PREFACE}/$2/${URL_SUFFIX}"

SUPPLYING_BRANCH_REV="`svn log --stop-on-copy ${TO_MERGE} | grep 
"r[0-9]" | tail -n 1 | cut -f 1 -d '|' | cut -f 2 -d 'r' | sed 's/ //'`"

TARGET_BRANCH_REV="`svn info ${TARGET} | grep Revision | cut -f 2- -d ':' |sed 's/^[ ]*//'`"


echo "Merging ${TO_MERGE}@${SUPPLYING_BRANCH_REV} with ${TARGET}@${TARGET_BRANCH_REV}"

svn merge -r${SUPPLYING_BRANCH_REV}:${TARGET_BRANCH_REV} ${TO_MERGE}

Call this script with two parameters: first the source of the merge information and the second the target tag. You'll note, in essence, that this second script incapsulates the functionality of the first.

Summary
When conducting merges with SVN, please consider the source revision and forks carefully (on a file-by-file basis if necessary). While other mechanisms may work to merge things without this step, combining multiple forks later will likely create unexpected conflicts and difficulty.

(*1) To me, this represents defeat on the part of the version control system. What purpose should a version control system serve if not to keep track of this very branch information for use in resolving merge scenarios?

Tuesday, March 1, 2011

Ensuring Super Class Initialization

Seemingly a very simple concept: how do you guarantee that when someone sub-classes your Python class that your constructor ( __init__() ) runs? The straightforward method contains a trap:

  

class Base(object):  
   def __init__(self):
      print "foo"

class Sub(Base):
   def __init__(self): 
      print "bar"

When the Sub class' initialize method is executed it squashes the Base class'. What we want is for both to be called:

  
class Sub(Base):
   def __init__(self):
      Base.__init__(self)
      print "bar"

Unfortunately, we can't make the person extending our class call the super class. A poor man's factory pattern alleviates the possibility of subtype implementers forgetting initialization. Consider:

class AbstractClass(object):
    '''Abstract base class template, implementing factory pattern through 
       use of the __new__() initializer. Factory method supports trivial, 
       argumented, & keyword argument constructors of arbitrary length.'''

   __slots__ = ["baseProperty"]
   '''Slots define [template] abstract class attributes. No instance
       __dict__ will be present unless subclasses create it through 
       implicit attribute definition in __init__() '''

   def __new__(cls, *args, **kwargs):
       '''Factory method for base/subtype creation. Simply creates an
       (new-style class) object instance and sets a base property. '''
       instance = object.__new__(cls)

       instance.baseProperty = "Thingee"
       return instance

This base class can be extended trivially, using only three (3) lines of code san-commment, as follows:

class Sub(AbstractClass):
   '''Subtype template implements AbstractClass base type and adds
      its own 'foo' attribute. Note (though poor style, that __slots__
      and __dict__ style attributes may be mixed.'''

   def __init__(self):
       '''Subtype initializer. Sets 'foo' attribute. '''
       self.foo = "bar"

Note that though we didn't call the super-class' constructor, the baseProperty will be initialized:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from TestFactory import *
>>> s = Sub()
>>> s.foo
'bar'
>>> s.baseProperty
'Thingee'
>>>

As its comment indicates, the base class AbstractClass need not use slots, it could just as easily 'implicitly' define attributes by setting them in its new() initializer. For instance:

instance.otherBaseProperty = "Thingee2"

would work fine. Also note that the base class' initializer supports trivial (no-arg) initializers in its subtypes, as well as variable-length arugmented and keyword argument initializers. I recommend always using this form as it doesn't impose syntax in the simplest (trivial constructor) case but allows for the more complex functionality without imposing maintenance.

Monday, January 31, 2011

Pythonic Extrinsic Visitor

Regardless of the language chosen to express it, the Visitor pattern remains one of the more basic behavioral patterns described by the "Gang of Four's" Design Patterns Book. Herein I describe a Pythonic Visitor implementation that overcomes common language-based limitations of C, C++, Java, and C# Visitors and present a usable implementation that goes well beyond available sample/snippets. In particular, the implementation provided addresses two commonly omitted but crucial features "left as an exercise to the reader": correct method resolution/dispatch of types within an inheritance hierarchy and caching method resolution results for performance's sake. More on those and other features later.

Challenges
Implementing a visitor pattern in practice poses challenges:

Applying visitors to existing type hierarchies demands refactoring potentially large sets of visitors' visited types to include accept() method
Refactoring described by the previous bullet, in-effect, couples the visitor and visited hierarchy through the visitor accept()/ visit() interface. Additions to the visited type tree hierarchy thus commonly demand commensurate changes to the visitor API and potentially its implementation
Behavior of language-specific polymorphic dispatch (in this case double-dispatch) often elude both junior and senior programmers alike.

Visitors take different forms, actually. Spurred by problems resulting from bullets #1 and #2 above, some design and implement Extrinsic Visitors. Informally, an extrinsic visitor makes decisions about how to 'visit' a visitable object without relying on double-dispatch. Put another way: the visitor need not call a method on the target visitable object itself to 'reclaim' its specific type before selecting the applicable visit() method.

Challenges to Extrinsic Visitation
When Developers choose languages such as Java or C# to implement an extrinsic visitor they must rely on reflection or introspection (respectively) in order to reclaim the visitable object's specific type without invoking a method of that object's instance. During code review, I've observed sloppy implementations revert to a switch/case statement failure mode--again introducing brittle dependence between the visitor implementation and the visited type hierarchy (#2 above). Bullet #3 (above) remains a problem, but challenges represented by bullets one and two are replaced with the following:

Sandbox security policy (Applets, etc.) may prevent use of reflection/introspection in some environments
Reflection/introspection can cause a costly speed hit depending on the language, and VM implementation (not nearly as much a problem with modern Java 1.5+)
Depending on visitor design, visitor API may remain rigidly tied to visited hierarchy demanding an explicit concrete method for each visitable type (as in #2 above)

Search for Extrinsic Visitor in Python and Google will return plenty of results. Often, implementations provided express the problems described by the bullets above or leave their solutions as exercises to the reader. Unfortunately, many programmers I encounter simply do not possess the familiarity with Python's object model or method-resolution mechanics (mro) to quickly accomplish the goal. Having fought my own lack of Python knowledge (hard to un-seat understanding of Java), I can sympathize.

Pythonic Extrinsic Visitor: Design
The design of my Pythonic visitor bear the following properties:

Extrinsic

The visitor and its sub-classes require no invocation of visitable type methods (double-dispatch) in order to reclaim specific target type
The visitable type requires no refactoring (IE: no accept() method necessary)

The visitor employs Python mro for "C++-style" best-match visit() method dispatch
The first method resolution result is cached to avoid hit on subsequent all calls to the visit() method for a particular type
Support for NoneType (IE: visitor.visit(None))

Sections below describe each design element above in turn.

Extrinsic
At its heart, the visitor class has only two public methods:

def visit(self, visitable)
def defaultVisitor(self, visitable)

Along with the class' constructor, these make up the visitor API. As with any extrinsic visitor, developers enjoy the follow the following simplified calling convention:

fruit = grocerer.pick("Banana")

visitor = MyVisitor()

visitor.visit(fruit)

Snippet - A - Simplest Visitor Test Driver

Presume here that grocer.pick() represents a factory method capable of returning a variety of fruit sub-types based on their string name. The defaultVisitor() method provides developers the ability to specify a 'catch-all' visitation function, capable of handling any typed passed to visit() as its parameter. Developers simply override the defaultVisitor() method in their subclass. Thus, the following represents a "Hello World" attempt at extending the visitor provided:

class MyVisitor(visitor):
def defaultVisitor(self, visitable):
print '%s' %(visitable)

The visitor base class provides a defacto implementation for the defaultVisitor() method so Developers need not override. The defacto implementation throws TypeError when the implementation's method resolution can not resolve a specific handler match.

The key feature of this visitor implementation is that those extending only need to implement those visit (handler) methods they desire explicitly. Developers do this by adding methods to their subclass in the form:

def visit(self, visitable): pass

So, from the example above, a Developer might choose to do the following:

class MyVisitor(visitor):

    def visitBanana(self, visitable):

         print '%s' %(visitable) 

Because this visitor employs a 'best match' dispatch, the following would have the same effect in our test driver captioned "Snippet A".

class MyVisitor(visitor):

    def visitFruit(self, visitable):

         print '%s' %(visitable) 

More on how the implementation accomplishes this in the next section.

Best Match
The best match property is the key to overcoming brittle coupling and tedious visitor implementation update. As designed, the visitor described herein employs a lookupVisitorMethod() method to accomplish this. Roughly, the algorithm employed follows:

def visit(self, visitable):
    clazz = visitable.__class__

    for _type in clazz.mro():
       handler = 'visit%s' %(_type.__name__)
       visit_method = getattr(self, handler, None)
       if not (visit_method is None):
          return visit_method(visitable)
    return visit_method(visitable)

Roughly, this algorithm iterates through the visitable objects' class hierarchy and looks for a method in the visitor object named visit(). Remember, mro() is an ordered list (Specific to most generic).

Caching
Caching method resolution proved the most difficult task in creating the presented implementation. The critical decision: what scope in which to place the method resolution cache. Placed at a high level (class object, module scope, or similar) the performance will increase but the caching scheme need be more advanced to handle circumstances in which multiple visitor sub-classes visit overlapping visitable type trees. Seeking to avoid this complexity, I chose to place the method cache in the visitor object instance.

def visit(self, visitable):
    clazz = visitable.__class__                     
    try:
        visit_method = self._method_map_cache[clazz]
    except KeyError:
        visit_method = self._lookupMethod(visitable, clazz)

    return visit_method(visitable)
                        
def _lookupMethod(self, target, clazz):    
   for _type in clazz.mro(): 
      handler =  'visit%s' %(_type.__name__) 
      visit_method = getattr(self, handler, None)
      if not (visit_method is None):
         self._method_map_cache[clazz] = visit_method
         return visit_method
   self._method_map_cache[clazz] = self.defaultVisitor
   return self.defaultVisitor

This implementation's scheme performs sufficiently well for even tight-loop operations (such as AST, DOM, and other large tree visitations) but avoids visitor/visited type tree collision. For more information, see the test suite. While not perfect, the implementation strategy above avoids opaque error messages and some more odd stack-frame insertions that complicate debugging (as experienced with other visitor schemes).

NoneType
The dispatching schemes as adopted by this algorithm (and other samples available online) fall prey to a simple failure mode: a caller passing a None rather than a valid instance (IE: v.visit(None)). Sample code available online almost always omits support for this corner case, despite how common it can be in tree visitations and collection iteration. This implementation solves the problem of handling None types with a special case prefacing the lookup method:

class dummy_lookup(object):
   def mro():
     return [type(None)]

def _lookupMethod(self, target, clazz):
   if clazz is None:
     clazz = dummy_lookup()

   ...

The effect of the dummy lookup class is to serve the mro() iterator that follows in the lookup method.

Configurable Visit method
The implementation provided provides an addl. feature to avoid name-based collisions for existing visitation logic that an adopter may want to move to the visitor pattern. The implementation's constructor takes a visit_method= keyword parameter that defines what prefix visitor subclass definitions will use to define specific visit methods. Use this feature when defining your visitor as follows:

class myVisitor(visitor):

    def __init__(self, *args, **kwargs):
        kwargs['visit_method'] = 'handle'
        visitor.__init__(*args, **kwargs)

   def handle(self, visitable): return self.visit(visitable)

   def handleType1(self, visitable): pass
   def handleType2(Self, visitable): pass

This mechanism allows developers to hide the visitor pattern's name from those calling it. Electing to call the visit method by another name (PrettyPrint, handle, ) can be more intuitive.

Download
Download the current code base from its Google Code Repository or use SVN to do a checkout using URL:

svn checkout http://code-poetic.googlecode.com/svn/trunk/python/lib code-poetic-read-only

Prerequisites

Python 2.x

Notes
N/A.

Thursday, January 20, 2011

Creating Dictionaries & Cracking Passwords on Mac OS X

If you've forgotten your password on a Mac OS X machine, for either a disk image (sparse bundle, sparse image, etc.) or keychain, you know a feeling of hopelessness. If you're going to attempt to break into your own encrypted store, you have two difficulties:

Generating a minimal password dictionary
Adapting a dictionary password test driver to OS X password-checking

Test Drivers: Simple Examples Rarely Suffice
When, frantically, I realized that I was unable to manually guess my image password back in 2008, I wasn't satisfied with what image-cracking utilities Google search returned. The simplified-for-demonstration-purposes test drivers provided in forums didn't handle the kinds of characters my passwords contained (redirects '<', '>' and pipes '|').

I created two Bash scripts to drive unlock attempts from a created password dictionary:

unlock_image (using hdiutil attach)
unlock_keychain (using security unlock-keychain)

Each conforms to shell scripting norms, returning 0 on success (and printing the successful password), returning 1 on failure, and returning 2 on error.

Optimizing Minimal Dictionary Creation
Worse, because my passwords typically use a large variety of character sets, I feared coming up with a minimal dictionary would prove too difficult. Back-of-the-napkin estimations on 1) cost per password attempt (time) and 2) dictionary size indicated success would demand some tricks.

Generating an inclusive dictionary proved impossible quickly. The image I was trying to crack had a twelve (12) to fourteen (14) character password, containing (what I believe to be) some combination of fourteen (14) unique characters. This produced, without compression or clever storage solutions, a multi-hundred gigabyte dictionary that would require 48 years to exhaustively attempt using my four-machine set-up.

Robust rule-based password dictionary creation utilities exist but I found these projects a false-optimization to cracking one's own password. You know some aspects of your passwords character set and structure but I found that including too many rules into dictionary creation (creating few-minute or few-hour) attempts always came up empty--my dictionaries were too restrictive.

As a result, I created a simple dictionary creation utility that allowed its user to specify a few password qualities:

User-specified password dictionary character set
Minimum password length
Maximum password length
Are password characters unique (or may they be repeated?)
Fixed password preface
Fixed password suffix

With this utility I created dictionaries inclusive-enough to find the passwords I'd forgotten while (I believe) striking a decent balance between complexity of dictionary specification and resulting password dictionary size.

Download
To download, build, and use the utilities, please go to its Google Code Repository:

Google Code - dmg_password_crack

Prerequisites

Mac OS X
Maven2

Notes

Implementation requires only modest memory, requiring 384K of RAM at steady-state under default settings
Implementation appears I/O bound, not bound by syscalls (malloc/free), memory, or computation
Password dictionary creation code does not guard explicitly against bad or malicious input. Even careless values (such as negatives) create unexpected results.
Recursion and memory allocation, as implemented, lessen LoC but will not perform as well as a well set-up fixed array and iteration
hdiutil attach remains the 'long pole' (critical path, most onerous step, etc.) in utilities
Code compiles error-/warning-free on Apple's gcc 4.2.1, i686 and performs well on Valgrind leak check, but fails to conform to OOA/D and style conventions
Code passes four included integration-level tests on OS=OS X 10.6.6 arch=x86_64

Code Poetic