Tuesday, June 25, 2013

Pitfalls of Unicode Usernames and Solutions in Canonical Form and UID

In a blog entrySpotify Labs recently shared their experience with difficulties supporting unicode usernames and differentiating them from their ascii homoglyphs. The entry describes, conceptually, how they isolated the problem's cause and worked around it.

The blog entry is worth a read, it does a great job of outlining the problem, tracing symptoms to root cause, and positing conceptual fixes. Key design and implementation specifics, however, remain uncovered. Readers will continue to struggle to build a workable unicode aware username scheme. This post attempts to tackle those 'last mile' design issues, providing a demonstrative stub implementation and test cases helpful in exploring particulars.


Users may want fancy unicode usernames to differentiate themselves. The spotify blog provides an example: "ᴮᴵᴳᴮᴵᴿᴰ". ❤❤ LuvYa❤❤ might be another. Site owners, on the other hand, want it to easily identify unique users and display their usernames in a variety of views. Site owners also want to make sure users can effectively differentiate those they know from those they don't. For instance, user Ohm (Ω) will be indistinguishable from user Omega (Ω). Users want this too--though they may not yet realize it. So, our requirements are:
  • Users may freely select fancy unicode usernames (or other profile elements);
  • The site can easily and consistently identify unique users;
  • Sites can easily display fancy usernames in its many view contexts as intended (instead of escaped);
  • Users can differentiate those users they know from those with which they're not familiar.
Recently, sites appear to be enforcing restrictive username and credential requirements rather than meeting the above challenges. This may be because of the difficulties involved or perhaps because security consultants and their standards/"best-practices" argue that "special characters" are dangerous. Making matters worse for Spotify they appears to be using Python 2.x, a language known for its awkward and poor support for unicode.

I find it deplorable when security professionals impact function or performance because they're too lazy to come up with a better solution; security should aim to ease user workflow and help performance where possible. As a user, I vote as such with my money: I closed my Discover credit account when they locked my account because my username used special characters--suddenly falling afoul of their new [WEAKER] security standards.


Solving for the above requirements involves two major components: a) consistent escaping and encoding of input from potentially different sources and b) universally unique identifier (UUID) creation.  Both solution elements end up presenting significant challenges in large enterprise systems--especially those already in place without them. To hedge against difficulty, this entry combines several heuristics and design elements:

  • Format data consistently;
  • Decouple UID and freeform text;
  • Encapsulate data and operations within strong types;
  • Establish boundaries across which guarantees are enforced. 

The hope is that each is straightforward enough to easily complete and that when combined in a mutually reinforcing manner, they create a workable solution.  Let's start, in no particular order, with establishing boundaries with guarantees.

Boundaries with Data Property Guarantees

Where  possible define and enforce boundaries across which components, services, or clients can assert key properties about data's form. Some call these boundaries "trust boundaries". Avoid this word because it's unclear what "trust" means. Be specific about what properties are being asserted across the boundary. Always design assertions around a provider asserting a guarantee rather than a consumer. 

Candidate boundaries may be a) data stores, b) web services, c) transaction processors, d) rendering components, and other elements whose behavior or structure are well-specified. From a technology perspective, persistence frameworks, REST APIs, and message buses may represent candidates through which a façade, proxy, or similar design pattern could enforce such a boundary and its guarantees. 

The attractiveness of such boundaries and their guarantees is clear: if we could assert the format and content of data at an interface, we'd not be confused about whether or not its been escaped yet. We'd know its encoding definitively. As with anything so attractive, achieving this goal is commonly impossible. These boundaries are like a Maginot line. Any break and the guarantee fails. So, inject data guarantee functionality into technologies that implement helpful patterns (like the façade). Watch for back-doors, chutes, and stovepipes around them. Meticulously document the format and content consumers can expect from these data sources. Now, onto those formatting guarantees.  

Consistent Data Formatting

A few common problems plague data formatting attempts. Some include:

  • Double escaping;
  • Data destruction in transcoding (e.g. data loss between unicode and UTF-8)
  • Unintended interpretation of formatted data in unexpected contexts:
    • e.g. by nest contexts such as HTML enclosing scripts
    • e.g. by downstream processors such as SQL parsers
In order to correctly and consistently format data, the design should:

  • Prefer idempotent formatting operations;
  • Consistently apply non-idempotent formatting actions in a specific order-of-operations;
  • Where possible, mark data as to current state of format;
  • Explicitly document input format expectations, transforms, and output format.

Idempotent transforms

Ideally, any transform applied to data will be idempotent. In other words, the transform produces the same output regardless of how many times its applied to the input in succession. i.e.
  canonize(canonize(x)) == canonize(x)
Failure to apply an idempotent transform causes the 'double escape' problem described above. It's easy to see how escaping operations are not idempotent.   For example:
>>> s='Ω'>>> s2 = s.encode('string-escape')>>> s'\xe2\x84\xa6'

>>> s2'\\xe2\\x84\\xa6'
Each time the encode() function is applied the result receives an additional level of escaping. This results in the following:
>>> print sΩ>>> print s2\xe2\x84\xa6
Luckily, in Python, the decode() function provides what perhaps some developers intend:
>>> print s.decode('string-escape')Ω>>> print s2.decode('string-escape')Ω
The differences above can confuse even more experienced developers not familiar with Python's string handling and escaping.  In order to address these idiosyncrasies we use python's unicodedata.normalize()(see docs) method.
def canonize(clazz, unwashed):
        # do some stuff ...
        washed = unicodedata.normalize('NFD', unwashed)
        # do other stuff...
        return washed
The above helps us exhibit idempotent behavior.

Order of Operations

Some of the operations we employ will destroy data (truncate() for instance). For this reason and those discussed previously (escaping) formatting operations are not transitive. i.e.
A(B(x)) != B(A(x) AND escape(trunc(x)) != trunc(escape(x))
Solutions, therefore, must pay very close attention to the the order in which operations are applied. In the past, the following order has worked well:
  1. Encode
  2. Substitute
  3. Truncate
  4. Escape
  5. Mark / Decorate
Plenty of reasons exist to change this order. For instance, some systems benefit from applying truncation first to avoid choking on large (malicious) inputs. In these cases, operations may have to be re-applied later because of the effect intermediate operations have (for instance, escaping may add to length after a previous truncation). 

In some cases components need to reverse these operations. This is not always possible where a transform has destroyed data (like truncate(), strip(), and their ilk). When reversing operations, apply them in reverse order.  This works well when combined with object- or API-based boundaries:
class MyObject:
  def __init__(self):
     self.__stuff = Nonedef getStuff(self):
     return self.__stuff.decode('unicode-escape')
  def setStuff(self, stuff):
     self.__stuff = stuff.encode('unicode-escape')
  def processStuff(self):

Encapsulation holds and we manipulate the object escaped within the type's bounds. When passed across the object's boundary (with the getter) we always tear escaping down. This, BTW, is not enough on its own--the object's "yum()" may not operate correctly under circumstances in which double-encoding has occurred. The goal here is consistency. 

Data Marking

Data formatting can not always be accomplished in one atomic operation, from the system's perspective. For instance, controller code may understand how to conduct certain input validation operations whereas only model code can properly encode and escape appropriately for storage. In these instances, components may mark data so that downstream processing understands the current state of format. Common methods for markup include:
  • Use of invalid characters (where possible) (e.g. marked_hex = 0x0126898U in which 'U' might indicate unfiltered)
  • Marks placed beyond range (e.g. $2a$10$KssILxWNR6k62B7yiX0GAe2Q7wwHlrzhF3LqtVvpyvHZf0MwvNfVu in which '$' separates metadata)
  • Markup languages (XML, JSON)
  • Container types (e.g. class SafeString)
Situations in which untrusted source provide data introduce the opportunity for exploit. For instance, there's no reason the attacker has to match the metadata to the actual format of the underlying data. When operations are then applied unexpected results (collisions, double-encodings, etc.) may result to the attacker's benefit.

Decouple UUID from Freeform Text

The difficulties of getting freeform text formatting and storage correct (and an inability to force other developers' and partners' developers to use the types in which this text has been encapsulated rather than nekked strings has led me--in every case I can remember--to decouple UIDs from the username. Using UIDs as the basis for identification, equality, indexing/hashing, and other operations has other benefits (speed, ease) as well.

There are many ways to create a UID. In python, the easiest way is perhaps uuid.uuid4() (see docs). UUID4, defined by RFC 4122,  is random. In some circumstances, however, developers want distributed parties to be able to compute the uuid where it's not provided. In these circumstances uuid.uuid3()may be appropriate. In the code provided with this blog entry, uuid3 is implemented "manually" for demonstrative purposes.

When encapsulating UID and username within a type, strongly consider overloading pertinent operators (such as __eq__()and __ne__()) so that other developers' conditionals don't surprise them with odd behavior:

 def __eq__(self, other):
        if (isinstance(other, UserName)):
            if (self.UID == other.UID):
                return True
            return NotImplemented
        return False

Also consider overriding __str__(). To be quite fancy, disable accessor / mutator methods to keep encapsulation and symmetric data formatting behaviors intact. **Note: Python's faux object orientation means that these steps can ultimately be subverted by a  clever developer.


Though this entry has gotten more concrete than the good start provided by Spotify Labs' entry, considerations remain.

Homographic Attack

In properly supporting display of fancy unicode characters, the system exposes users to homographic attack. Consider the following:
An Ohm --> 'Ω' := \u2126
Is the same character as another

An Omega --> 'Ω' := \u03A9
By overloading __str__() to return both the text and UID we can provide the programmer some visual cues:

[Ω bf59c742252007241ceddb9029d6baef] ?= [Ω a744dcc2e949c8bfc8bff101f9df38f9] --> False

Note that the uname.UID handles are different. We should instruct view components to propagate such cues to their users. One way to do this, for user names, might be to show the user the visual form (unicode) but maintain a list of previously seen user names. When Omega shows up, having only seen Ohm, the UI might say, "jOHN, [NEW] User Ω wants to connect with you. You've not seen Ω before." This is not fool proof, but can be done cleverly. 


Where you're able to build a strong type, I recommend baking pickling to popular formats into that object. You can also strongly associate custom externalization with common APIs (like Java's Externalizable interface). For instance, mindful that JSON has strong feelings about unicode characters, we'd want to make sure that the UName class supported jsonpickle (and pickle for that matter) before we considered it done. 
import jsonpicklem = UName('Ω')m.text_nameu'\u03a9'jsonpickle.encode(m)'{"py/object": "__main__.UName", "text_name": "\\u03a9", "user_number": "0x7", "UID": "a744dcc2e949c8bfc8bff101f9df38f9"}'
As you can see from the above example, failure to do so results in double escaping.

UTF-8 and Source Files

Common guidance on forums (such as ActiveState and Stackoverflow) indicate that converting unicode to UTF-8 solves the challenges presented here. It may, but not always. Consider:
str1 = some_unicode_str.encode('utf8')
This only works in certain circumstances. For instance the following line can be stored UTF-8 (say, in a Python source file):
   u2 = UName('ᴮᴵᴳᴮᴵᴿᴰ')
Whereas if you try to store my aforementioned Omega and Ohm symbols:
 h1 = UName('Ω') #u'\u2126' h2 = UName('Ω') #u'\u03A9'
Both the above come back as \u03A9 when you put them back in unicode format. In essence, UTF-8 encoding the source file destroys data. To help solve this problem, preface your python source with:
# -*- coding: utf-8 -*-from __future__ import unicode_literals
import unicodedata 
In closing, Python 2.x sure does have trouble with unicode eh?


Find demonstration material for this entry here: username on Google code. Check this out using:

svn checkout http://code-poetic.googlecode.com/svn/trunk/python/username username_demo