Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Tuesday, June 25, 2013

Pitfalls of Unicode Usernames and Solutions in Canonical Form and UID

In a blog entrySpotify Labs recently shared their experience with difficulties supporting unicode usernames and differentiating them from their ascii homoglyphs. The entry describes, conceptually, how they isolated the problem's cause and worked around it.

The blog entry is worth a read, it does a great job of outlining the problem, tracing symptoms to root cause, and positing conceptual fixes. Key design and implementation specifics, however, remain uncovered. Readers will continue to struggle to build a workable unicode aware username scheme. This post attempts to tackle those 'last mile' design issues, providing a demonstrative stub implementation and test cases helpful in exploring particulars.

Challenges

Users may want fancy unicode usernames to differentiate themselves. The spotify blog provides an example: "ᴮᴵᴳᴮᴵᴿᴰ". ❤❤ LuvYa❤❤ might be another. Site owners, on the other hand, want it to easily identify unique users and display their usernames in a variety of views. Site owners also want to make sure users can effectively differentiate those they know from those they don't. For instance, user Ohm (Ω) will be indistinguishable from user Omega (Ω). Users want this too--though they may not yet realize it. So, our requirements are:
  • Users may freely select fancy unicode usernames (or other profile elements);
  • The site can easily and consistently identify unique users;
  • Sites can easily display fancy usernames in its many view contexts as intended (instead of escaped);
  • Users can differentiate those users they know from those with which they're not familiar.
Recently, sites appear to be enforcing restrictive username and credential requirements rather than meeting the above challenges. This may be because of the difficulties involved or perhaps because security consultants and their standards/"best-practices" argue that "special characters" are dangerous. Making matters worse for Spotify they appears to be using Python 2.x, a language known for its awkward and poor support for unicode.

I find it deplorable when security professionals impact function or performance because they're too lazy to come up with a better solution; security should aim to ease user workflow and help performance where possible. As a user, I vote as such with my money: I closed my Discover credit account when they locked my account because my username used special characters--suddenly falling afoul of their new [WEAKER] security standards.

Solution

Solving for the above requirements involves two major components: a) consistent escaping and encoding of input from potentially different sources and b) universally unique identifier (UUID) creation.  Both solution elements end up presenting significant challenges in large enterprise systems--especially those already in place without them. To hedge against difficulty, this entry combines several heuristics and design elements:

  • Format data consistently;
  • Decouple UID and freeform text;
  • Encapsulate data and operations within strong types;
  • Establish boundaries across which guarantees are enforced. 

The hope is that each is straightforward enough to easily complete and that when combined in a mutually reinforcing manner, they create a workable solution.  Let's start, in no particular order, with establishing boundaries with guarantees.

Boundaries with Data Property Guarantees

Where  possible define and enforce boundaries across which components, services, or clients can assert key properties about data's form. Some call these boundaries "trust boundaries". Avoid this word because it's unclear what "trust" means. Be specific about what properties are being asserted across the boundary. Always design assertions around a provider asserting a guarantee rather than a consumer. 

Candidate boundaries may be a) data stores, b) web services, c) transaction processors, d) rendering components, and other elements whose behavior or structure are well-specified. From a technology perspective, persistence frameworks, REST APIs, and message buses may represent candidates through which a façade, proxy, or similar design pattern could enforce such a boundary and its guarantees. 

The attractiveness of such boundaries and their guarantees is clear: if we could assert the format and content of data at an interface, we'd not be confused about whether or not its been escaped yet. We'd know its encoding definitively. As with anything so attractive, achieving this goal is commonly impossible. These boundaries are like a Maginot line. Any break and the guarantee fails. So, inject data guarantee functionality into technologies that implement helpful patterns (like the façade). Watch for back-doors, chutes, and stovepipes around them. Meticulously document the format and content consumers can expect from these data sources. Now, onto those formatting guarantees.  

Consistent Data Formatting

A few common problems plague data formatting attempts. Some include:

  • Double escaping;
  • Data destruction in transcoding (e.g. data loss between unicode and UTF-8)
  • Unintended interpretation of formatted data in unexpected contexts:
    • e.g. by nest contexts such as HTML enclosing scripts
    • e.g. by downstream processors such as SQL parsers
In order to correctly and consistently format data, the design should:

  • Prefer idempotent formatting operations;
  • Consistently apply non-idempotent formatting actions in a specific order-of-operations;
  • Where possible, mark data as to current state of format;
  • Explicitly document input format expectations, transforms, and output format.

Idempotent transforms

Ideally, any transform applied to data will be idempotent. In other words, the transform produces the same output regardless of how many times its applied to the input in succession. i.e.
  canonize(canonize(x)) == canonize(x)
Failure to apply an idempotent transform causes the 'double escape' problem described above. It's easy to see how escaping operations are not idempotent.   For example:
>>> s='Ω'>>> s2 = s.encode('string-escape')>>> s'\xe2\x84\xa6'

>>> s2'\\xe2\\x84\\xa6'
Each time the encode() function is applied the result receives an additional level of escaping. This results in the following:
>>> print sΩ>>> print s2\xe2\x84\xa6
Luckily, in Python, the decode() function provides what perhaps some developers intend:
>>> print s.decode('string-escape')Ω>>> print s2.decode('string-escape')Ω
The differences above can confuse even more experienced developers not familiar with Python's string handling and escaping.  In order to address these idiosyncrasies we use python's unicodedata.normalize()(see docs) method.
def canonize(clazz, unwashed):
        # do some stuff ...
        washed = unicodedata.normalize('NFD', unwashed)
        # do other stuff...
        return washed
The above helps us exhibit idempotent behavior.

Order of Operations

Some of the operations we employ will destroy data (truncate() for instance). For this reason and those discussed previously (escaping) formatting operations are not transitive. i.e.
A(B(x)) != B(A(x) AND escape(trunc(x)) != trunc(escape(x))
Solutions, therefore, must pay very close attention to the the order in which operations are applied. In the past, the following order has worked well:
  1. Encode
  2. Substitute
  3. Truncate
  4. Escape
  5. Mark / Decorate
Plenty of reasons exist to change this order. For instance, some systems benefit from applying truncation first to avoid choking on large (malicious) inputs. In these cases, operations may have to be re-applied later because of the effect intermediate operations have (for instance, escaping may add to length after a previous truncation). 

In some cases components need to reverse these operations. This is not always possible where a transform has destroyed data (like truncate(), strip(), and their ilk). When reversing operations, apply them in reverse order.  This works well when combined with object- or API-based boundaries:
class MyObject:
  def __init__(self):
     self.__stuff = Nonedef getStuff(self):
     return self.__stuff.decode('unicode-escape')
 
 
  def setStuff(self, stuff):
     self.__stuff = stuff.encode('unicode-escape')
 
 
  def processStuff(self):
    yum(self.__stuff) 

Encapsulation holds and we manipulate the object escaped within the type's bounds. When passed across the object's boundary (with the getter) we always tear escaping down. This, BTW, is not enough on its own--the object's "yum()" may not operate correctly under circumstances in which double-encoding has occurred. The goal here is consistency. 

Data Marking

Data formatting can not always be accomplished in one atomic operation, from the system's perspective. For instance, controller code may understand how to conduct certain input validation operations whereas only model code can properly encode and escape appropriately for storage. In these instances, components may mark data so that downstream processing understands the current state of format. Common methods for markup include:
  • Use of invalid characters (where possible) (e.g. marked_hex = 0x0126898U in which 'U' might indicate unfiltered)
  • Marks placed beyond range (e.g. $2a$10$KssILxWNR6k62B7yiX0GAe2Q7wwHlrzhF3LqtVvpyvHZf0MwvNfVu in which '$' separates metadata)
  • Markup languages (XML, JSON)
  • Container types (e.g. class SafeString)
Situations in which untrusted source provide data introduce the opportunity for exploit. For instance, there's no reason the attacker has to match the metadata to the actual format of the underlying data. When operations are then applied unexpected results (collisions, double-encodings, etc.) may result to the attacker's benefit.

Decouple UUID from Freeform Text

The difficulties of getting freeform text formatting and storage correct (and an inability to force other developers' and partners' developers to use the types in which this text has been encapsulated rather than nekked strings has led me--in every case I can remember--to decouple UIDs from the username. Using UIDs as the basis for identification, equality, indexing/hashing, and other operations has other benefits (speed, ease) as well.

There are many ways to create a UID. In python, the easiest way is perhaps uuid.uuid4() (see docs). UUID4, defined by RFC 4122,  is random. In some circumstances, however, developers want distributed parties to be able to compute the uuid where it's not provided. In these circumstances uuid.uuid3()may be appropriate. In the code provided with this blog entry, uuid3 is implemented "manually" for demonstrative purposes.

When encapsulating UID and username within a type, strongly consider overloading pertinent operators (such as __eq__()and __ne__()) so that other developers' conditionals don't surprise them with odd behavior:

 def __eq__(self, other):
        if (isinstance(other, UserName)):
            if (self.UID == other.UID):
                return True
        else:
            return NotImplemented
        return False

Also consider overriding __str__(). To be quite fancy, disable accessor / mutator methods to keep encapsulation and symmetric data formatting behaviors intact. **Note: Python's faux object orientation means that these steps can ultimately be subverted by a  clever developer.

Considerations

Though this entry has gotten more concrete than the good start provided by Spotify Labs' entry, considerations remain.

Homographic Attack

In properly supporting display of fancy unicode characters, the system exposes users to homographic attack. Consider the following:
An Ohm --> 'Ω' := \u2126
Is the same character as another

An Omega --> 'Ω' := \u03A9
By overloading __str__() to return both the text and UID we can provide the programmer some visual cues:

[Ω bf59c742252007241ceddb9029d6baef] ?= [Ω a744dcc2e949c8bfc8bff101f9df38f9] --> False

Note that the uname.UID handles are different. We should instruct view components to propagate such cues to their users. One way to do this, for user names, might be to show the user the visual form (unicode) but maintain a list of previously seen user names. When Omega shows up, having only seen Ohm, the UI might say, "jOHN, [NEW] User Ω wants to connect with you. You've not seen Ω before." This is not fool proof, but can be done cleverly. 

Pickles

Where you're able to build a strong type, I recommend baking pickling to popular formats into that object. You can also strongly associate custom externalization with common APIs (like Java's Externalizable interface). For instance, mindful that JSON has strong feelings about unicode characters, we'd want to make sure that the UName class supported jsonpickle (and pickle for that matter) before we considered it done. 
import jsonpicklem = UName('Ω')m.text_nameu'\u03a9'jsonpickle.encode(m)'{"py/object": "__main__.UName", "text_name": "\\u03a9", "user_number": "0x7", "UID": "a744dcc2e949c8bfc8bff101f9df38f9"}'
As you can see from the above example, failure to do so results in double escaping.

UTF-8 and Source Files

Common guidance on forums (such as ActiveState and Stackoverflow) indicate that converting unicode to UTF-8 solves the challenges presented here. It may, but not always. Consider:
str1 = some_unicode_str.encode('utf8')
This only works in certain circumstances. For instance the following line can be stored UTF-8 (say, in a Python source file):
   u2 = UName('ᴮᴵᴳᴮᴵᴿᴰ')
Whereas if you try to store my aforementioned Omega and Ohm symbols:
 h1 = UName('Ω') #u'\u2126' h2 = UName('Ω') #u'\u03A9'
Both the above come back as \u03A9 when you put them back in unicode format. In essence, UTF-8 encoding the source file destroys data. To help solve this problem, preface your python source with:
# -*- coding: utf-8 -*-from __future__ import unicode_literals
import unicodedata 
In closing, Python 2.x sure does have trouble with unicode eh?

Download

Find demonstration material for this entry here: username on Google code. Check this out using:

svn checkout http://code-poetic.googlecode.com/svn/trunk/python/username username_demo

Tuesday, March 1, 2011

Ensuring Super Class Initialization

Seemingly a very simple concept: how do you guarantee that when someone sub-classes your Python class that your constructor ( __init__() ) runs? The straightforward method contains a trap:
  

class Base(object):  
   def __init__(self):
      print "foo"

class Sub(Base):
   def __init__(self): 
      print "bar"
When the Sub class' initialize method is executed it squashes the Base class'. What we want is for both to be called:

  
class Sub(Base):
   def __init__(self):
      Base.__init__(self)
      print "bar"


Unfortunately, we can't make the person extending our class call the super class. A poor man's factory pattern  alleviates the possibility of subtype implementers forgetting initialization. Consider:

class AbstractClass(object):
    '''Abstract base class template, implementing factory pattern through 
       use of the __new__() initializer. Factory method supports trivial, 
       argumented, & keyword argument constructors of arbitrary length.'''

   __slots__ = ["baseProperty"]
   '''Slots define [template] abstract class attributes. No instance
       __dict__ will be present unless subclasses create it through 
       implicit attribute definition in __init__() '''

   def __new__(cls, *args, **kwargs):
       '''Factory method for base/subtype creation. Simply creates an
       (new-style class) object instance and sets a base property. '''
       instance = object.__new__(cls)

       instance.baseProperty = "Thingee"
       return instance

This base class can be extended trivially, using only three (3) lines of code san-commment, as follows:
class Sub(AbstractClass):
   '''Subtype template implements AbstractClass base type and adds
      its own 'foo' attribute. Note (though poor style, that __slots__
      and __dict__ style attributes may be mixed.'''

   def __init__(self):
       '''Subtype initializer. Sets 'foo' attribute. '''
       self.foo = "bar"


Note that though we didn't call the super-class' constructor, the baseProperty will be initialized:


Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from TestFactory import *
>>> s = Sub()
>>> s.foo
'bar'
>>> s.baseProperty
'Thingee'
>>> 

As its comment indicates, the base class AbstractClass need not use slots, it could just as easily 'implicitly' define attributes by setting them in its new() initializer. For instance:

instance.otherBaseProperty = "Thingee2"

would work fine. Also note that the base class' initializer supports trivial (no-arg) initializers in its subtypes, as well as variable-length arugmented and keyword argument initializers. I recommend always using this form as it doesn't impose syntax in the simplest (trivial constructor) case but allows for the more complex functionality without imposing maintenance.

Monday, January 31, 2011

Pythonic Extrinsic Visitor

Regardless of the language chosen to express it, the Visitor pattern remains one of the more basic behavioral patterns described by the "Gang of Four's" Design Patterns Book. Herein I describe a Pythonic Visitor implementation that overcomes common language-based limitations of C, C++, Java, and C# Visitors and present a usable implementation that goes well beyond available sample/snippets. In particular, the implementation provided addresses two commonly omitted but crucial features "left as an exercise to the reader": correct method resolution/dispatch of types within an inheritance hierarchy and caching method resolution results for performance's sake. More on those and other features later.

Challenges
Implementing a visitor pattern in practice poses challenges:
  1. Applying visitors to existing type hierarchies demands refactoring potentially large sets of visitors'  visited types to include accept() method
  2. Refactoring described by the previous bullet, in-effect, couples the visitor and visited hierarchy through the visitor accept()/ visit() interface. Additions to the visited type tree hierarchy thus commonly demand commensurate changes to the visitor API and potentially its implementation
  3. Behavior of language-specific polymorphic dispatch (in this case double-dispatch) often elude both junior and senior programmers alike. 
Visitors take different forms, actually. Spurred by problems resulting from bullets #1 and #2 above, some design and implement Extrinsic Visitors. Informally, an extrinsic visitor makes decisions about how to 'visit' a visitable object without relying on double-dispatch. Put another way: the visitor need not call a method on the target visitable object itself to 'reclaim' its specific type before selecting the applicable visit() method.

Challenges to Extrinsic Visitation
When Developers choose languages such as Java or C# to implement an extrinsic visitor they must rely on reflection or introspection (respectively) in order to reclaim the visitable object's specific type without invoking a method of that object's instance. During code review, I've observed sloppy implementations revert to a switch/case statement failure mode--again introducing brittle dependence between the visitor implementation and the visited type hierarchy (#2 above). Bullet #3 (above) remains a problem, but challenges represented by bullets one and two are replaced with the following:
 
  1. Sandbox security policy (Applets, etc.) may prevent use of reflection/introspection in some environments
  2. Reflection/introspection can cause a costly speed hit depending on the language, and VM implementation (not nearly as much a problem with modern Java 1.5+)
  3. Depending on visitor design, visitor API may remain rigidly tied to visited hierarchy demanding an explicit concrete method for each visitable type (as in #2 above)
Search for Extrinsic Visitor in Python and Google will return plenty of results. Often, implementations provided express the problems described by the bullets above or leave their solutions as exercises to the reader. Unfortunately, many programmers I encounter simply do not possess the familiarity with Python's object model or method-resolution mechanics (mro) to quickly accomplish the goal. Having fought my own lack of Python knowledge (hard to un-seat understanding of Java), I can sympathize.

Pythonic Extrinsic Visitor: Design
The design of my Pythonic visitor bear the following properties:
  • Extrinsic
    • The visitor and its sub-classes require no invocation of visitable type methods (double-dispatch) in order to reclaim specific target type
    • The visitable type requires no refactoring (IE: no accept() method necessary)
  • The visitor employs Python mro for "C++-style" best-match visit() method dispatch
  • The first method resolution result is cached to avoid hit on subsequent all calls to the visit() method for a particular type
  • Support for NoneType (IE: visitor.visit(None))
Sections below describe each design element above in turn.

Extrinsic
At its heart, the visitor class has only two public methods:
  • def visit(self, visitable)
  • def defaultVisitor(self, visitable)
Along with the class' constructor, these make up the visitor API. As with any extrinsic visitor, developers enjoy the follow the following simplified calling convention:

fruit = grocerer.pick("Banana") 
visitor = MyVisitor()
visitor.visit(fruit)
Snippet - A - Simplest Visitor Test Driver

Presume here that grocer.pick() represents a factory method capable of returning a variety of fruit sub-types based on their string name. The defaultVisitor() method provides developers the ability to specify a 'catch-all' visitation function, capable of handling any typed passed to visit() as its parameter. Developers simply override the  defaultVisitor() method in their subclass. Thus, the following represents a "Hello World" attempt at extending the visitor provided:
class MyVisitor(visitor):
    def defaultVisitor(self, visitable):
         print '%s' %(visitable)
The visitor base class provides a defacto implementation for the defaultVisitor() method so Developers need not override. The defacto implementation throws TypeError when the implementation's method resolution can not resolve a specific handler match.

The key feature of this visitor implementation is that those extending only need to implement those visit (handler) methods they desire explicitly. Developers do this by adding methods to their subclass in the form:
def visit(self, visitable): pass
So,  from the example above, a Developer might choose to do the following:


class MyVisitor(visitor):
    def visitBanana(self, visitable):
         print '%s' %(visitable) 

Because this visitor employs a 'best match' dispatch, the following would have the same effect in our test driver captioned "Snippet A".

class MyVisitor(visitor):
    def visitFruit(self, visitable):
         print '%s' %(visitable)

More on how the implementation accomplishes this in the next section.

Best Match
The best match property is the key to overcoming brittle coupling and tedious visitor implementation update. As designed, the visitor described herein employs a lookupVisitorMethod() method to accomplish this. Roughly, the algorithm employed follows:

def visit(self, visitable):
    clazz = visitable.__class__

    for _type in clazz.mro():
       handler =  'visit%s' %(_type.__name__)
       visit_method = getattr(self, handler, None)
       if not (visit_method is None):
          return visit_method(visitable)
    return visit_method(visitable)

Roughly, this algorithm iterates through the visitable objects' class hierarchy and looks for a method in the visitor object named visit(). Remember, mro() is an ordered list (Specific to most generic).

Caching
Caching method resolution proved the most difficult task in creating the presented implementation. The critical decision: what scope in which to place the method resolution cache. Placed at a high level (class object, module scope, or similar) the performance will increase but the caching scheme need be more advanced to handle circumstances in which multiple visitor sub-classes visit overlapping visitable type trees. Seeking to avoid this complexity, I chose to place the method cache in the visitor object instance.

def visit(self, visitable):     clazz = visitable.__class__                         try:         visit_method = self._method_map_cache[clazz]     except KeyError:         visit_method = self._lookupMethod(visitable, clazz)     return visit_method(visitable)                         
def _lookupMethod(self, target, clazz):    
   for _type in clazz.mro(): 
      handler =  'visit%s' %(_type.__name__) 
      visit_method = getattr(self, handler, None)
      if not (visit_method is None):          self._method_map_cache[clazz] = visit_method          return visit_method    self._method_map_cache[clazz] = self.defaultVisitor    return self.defaultVisitor
This implementation's scheme performs sufficiently well for even tight-loop operations (such as AST, DOM, and other large tree visitations) but avoids visitor/visited type tree collision. For more information, see the test suite. While not perfect, the implementation strategy above avoids opaque error messages and some more odd stack-frame insertions that complicate debugging (as experienced with other visitor schemes).

NoneType
The dispatching schemes as adopted by this algorithm (and other samples available online)  fall prey to a simple failure mode: a caller passing a None rather than a valid instance (IE: v.visit(None)). Sample code available online almost always omits support for this corner case, despite how common it can be in tree visitations and collection iteration.  This implementation solves the problem of handling None types with a special case prefacing the lookup method:

class dummy_lookup(object):
   def mro():
       return [type(None)]
                       
def _lookupMethod(self, target, clazz):
   if clazz is None:
       clazz = dummy_lookup()

   ...

The effect of the dummy lookup class is to serve the mro() iterator that follows in the lookup method.

Configurable Visit method
The implementation provided provides an addl. feature to avoid name-based collisions for existing visitation logic that an adopter may want to move to the visitor pattern. The implementation's constructor takes a visit_method=  keyword parameter that defines what prefix visitor subclass definitions will use to define specific visit methods. Use this feature when defining your visitor as follows:

class myVisitor(visitor):
    def __init__(self, *args, **kwargs):
        kwargs['visit_method'] = 'handle'
        visitor.__init__(*args, **kwargs)

   def handle(self, visitable): return self.visit(visitable)

   def handleType1(self, visitable): pass
   def handleType2(Self, visitable): pass

This mechanism allows developers to hide the visitor pattern's name from those calling it. Electing to call the visit method by another name (PrettyPrint, handle, ) can be more intuitive.

Download
Download the current code base from its Google Code Repository  or use SVN to do a checkout using URL:

svn checkout http://code-poetic.googlecode.com/svn/trunk/python/lib code-poetic-read-only


Prerequisites
  • Python 2.x

Notes
N/A.