Language Design

Monday, June 22, 2009

The Lambda Calculus

Church’s λ-calculus and Turing’s machine are the two roots of programming as we know it. The grandchildren of the Turing-machine are imperative languages, while their cousins under the λ-calculus are declarative (mainly functional) languages.

The λ-calculus is basically the smallest possible functional programming language. An expression in the calculus is one of three things:

The entirety of the syntax consists of variable names, λ and . for abstractions, and ( and ) for applications. See the Wikipedia article for more.

Zen and the Art of List Processing

Code is data.

When a compiler works its magic, it first turns the source code of a program into a data structure called an abstract syntax tree (AST). (This is an involved process, so some steps are elided, but bear with me.)

The compiler might represent 1 + 2 in the AST as the node + with children 1 and 2. It might represent print "Hello, world." as the node print with the child "Hello, world.". Once it has this data structure constructed, it can decorate it (perhaps with types), rearrange it (perhaps for optimization), and turn it into different forms, eventually the compiled program.

Extending a compiler to do something new with an AST is normally an involved process. GCC (for example) is gigantic, while powerful, and you’ll need to learn quite a bit to modify it in any useful way.

What would be useful is to have the compiler provide an interface to the programmer so they could modify the compiler in their own program. This way, you could extend the compiler to recognize new syntax, or optimize code with new data types. This interface would have to use the language’s basic data types, so it could be available without a library. Lisp takes this to the logical conclusion.

Lisp is the λ-calculus with hooks into the compiler.

From the reader (which turns code into data) to the evaluator (which executes data as a program), all is available to the Lisp programmer. Code can write code purely by constructing data types before the evaluator acts on them. The reader is available for the programmer to extend. This is the Lisp nature.

For this sort of power, it makes sense to write code that looks like the AST. If the tree with + at the root and 1 and 2 as the children is what you want, write (+ 1 2). If you want to create a tree with print at the root and "Hello world." as its child, write (print "Hello, world."). Perhaps an if-statement has three children; the predicate, the true-clause, and the false-clause. You could write this as (if predicate true-clause false-clause).

Will it Blend?

If we were to create an abstract syntax tree from the λ-calculus, how would we construct it?

Well, that was simple. Oh, but this means that (+ 1 2) doesn’t work, because abstractions can only handle one argument and applications match. We’ll make a minor adjustment:

Also, we should probably extend the reader a little so it reads lambda as λ for people who can’t easily type in Greek. There we go, and code is data.

Well, I guess not quite.

Semantics

If code is data, we need a way of storing these things in the language itself. When the reader reads (M x y z), what does it store? It looks like we have two data types:

For convenience, we’ll add one more type: Numbers. The λ-calculus can simulate numbers just fine, using Church encoding, but it’s somewhat difficult to read and slow to compute with, so we’ll just use Arabic numerals.

What about the evaluator? When it sees (λ (x y) M), what should it produce? We need another type, the λ. Then when the evaluator sees ((λ (x y) M) 1 2), it can evaluate the root node, producing a λ, then evaluate 1, producing 1, then evaluate 2, producing 2, and then apply the λ to the two arguments.

Also, if we’re going to make this accessible to the programmer, we should probably give these actors names. The reader can be read, the evaluator eval, and the actor that applies λs to arguments can be apply. Oh, and we need something that can write data back out – write.

RE: Why Subversion does not suck

Thursday, December 11, 2008

(This is a response to Andy Singleton’s ”Why Subversion does not suck” blog post.)

The problem I have with this line of reasoning is that it assumes that DVCS must be more complicated than its centralized sibling.

This is just not true.

Let’s compare and contrast workflow of Subversion and Mercurial (which I know better than Git). We’ll even assume that a repository has already been created for you.

Setting up your local copy of the repository? svn checkout _url_ or hg clone _url_.

Updating? svn update or hg pull.

Committing work? svn commit (unless you’re off-line, then you just stop) or hg commit; hg push (unless you’re off-line, then you wait and batch your commits).

Changes to the filesystem? svn {add,delete,copy,move} == hg {add,remove,copy,rename}.

Want to see what has changed? {svn,hg} {status,diff}.

Ok, let’s get complicated. You want to set up a new repository for your company.

Subversion:

$ svnadmin create /usr/local/svn/newrepos
$ svn import mytree file:///usr/local/svn/newrepos/some/project

This is straight out of the docs, by the way. While you’re doing that, read over Chapter 5. Repository Administration and Chapter 6. Server Configuration to make sure you can keep it working.

The same with Mercurial? Get into the code directory you want to work with, and:

$ hg init && hg addremove

This will put all these files in the repository, so you could instead hg add those you’re interested in, but the former command would be Subversion’s create/import pair. Share with ssh, or put it on a shared directory, or whatever. You can even get the nice http interface by running hg serve (a good way to start, and you can ramp up later if you want).

Oh, and if it’s the graphical interface you want, Tortoise makes the same thing for Mercurial, as well: TortoiseHG (See also the Other Tools page for even more cool GUIs.)

As I said, I’m more familiar with Mercurial, so I’m not sure if all this holds for Git, but from what I’ve heard, it does. What about documentation? The Mercurial book is just as good as the Subversion one, IMHO.

tldr? Mercurial is as simple to learn (if not simpler) than Subversion, and is easier to play around with (hg init > svnadmin create).

Simple Python Scanner

Tuesday, October 14, 2008

"""
Scanner: match text to generate tokens.
Adam Blinkinsop <blinks@acm.org>

First, construct a scanner with the tokens you'd like to match described as
keyword arguments, using Python-syntax regular expressions.
WARNING: Group syntax in these expressions has an undefined effect.

>>> simple = Scan(ID=r'\w+')

You can now use this object to generate tokens by calling it with one or more
strings.

>>> tokens = list(simple('hello'))
>>> len(tokens), tokens[0]
(1, Token("ID -> 'hello'"))
>>> print simple
ID -> \\w+

Characters that don't match will raise an exception with the location of the
error (see http://www.gnu.org/prep/standards/html_node/Errors.html).

>>> tokens = list(simple('hello world'))
Traceback (most recent call last):
  ...
UnrecognizedCharacter: 1.5: couldn't match ' '
>>> len(tokens), tokens[0]
(1, Token("ID -> 'hello'"))

>>> simple = Scan(ID=r'\w+', SPACE=r'\s+')
>>> list(simple('hello world'))
[Token("ID -> 'hello'"), Token("SPACE -> ' '"), Token("ID -> 'world'")]

You can also ignore tokens to keep them from being generated.

>>> simple.ignore('SPACE')
>>> list(simple('hello world'))
[Token("ID -> 'hello'"), Token("ID -> 'world'")]

While this scanner doesn't keep track of lines 
"""
import re


class Error(Exception):
  """The generic error class for this module."""

class UnrecognizedCharacter(Error):
  """Position `pos` in the text doesn't match any tokens."""
  def __init__(self, string, start, stop=None):
    self.value = (string, start, stop)
    self.char = string[start:stop]
    self.span = span_of(string, start, stop)

  def __str__(self):
    return "%s: couldn't match %r" % (self.span, self.char)


class Token(object):
  """A token matched in a string."""
  def __init__(self, m):
    self.value = m.group()
    self.start, self.end = m.span()
    self.span = (self.start, self.end)
    self.pos = m.pos
    self.token = m.lastgroup
    self.string = m.string

  def __repr__(self):
    return 'Token(%r)' % str(self)

  def __str__(self):
    return '%s -> %r' % (self.token, self.value)
  

def span_of(string, start, stop):
  """Return a string representing the position of this slice."""
  def column_of(p):
    line_start = string.rfind('\n', 0, p)
    if line_start == -1:  return p
    else:  return p - line_start
  stline, stcol = string.count('\n', 0, start) + 1, column_of(start)
  loc = '%i.%i' % (stline, stcol)
  if stop > start + 1 and string.count('\n', start, stop) > 0:
    col = column_of(stop)
    return loc + '-%i.%i' % (
        stline + string.count('\n', start, stop), col)
  elif stop > start + 1:
    col = column_of(stop)
    return loc + '-%i' % (col)
  else:
    return loc

class Scan(object):
  """A scanner for a particular set of tokens (defined as keyword args)."""
  def __init__(self, **tokens):
    self.tokens = tokens
    self.__compile()
    self.ignores = set()

  def __call__(self, *args, **opts):
    """Call on a string (or a list of strings) to generate tokens."""
    # Start at the beginning of this text.
    text, pos = ''.join(args), 0
    while pos < len(text):
      # Match the text, looking for the next token.
      m = self.regex.match(text, pos)
      if m is None:
        # No token was found, raise an error.
        raise UnrecognizedCharacter(text, pos, pos + 1)
      elif m.lastgroup in self.ignores:
        # An ignored token was found, continue without yielding.
        pos = m.end()
        continue
      else:
        # Found a token; yield its name and the text it matched.
        yield Token(m)
        pos = m.end()

  def __str__(self):
    return '\n'.join('%s -> %s' % (k, v) for (k, v) in self.tokens.items())

  def __compile(self):
    """Compile the dict of tokens into a regular expression."""
    self.regex = re.compile('|'.join(
      '(?P<%s>%s)' % (k, self.tokens[k]) for k in self.tokens))

  def update(self, **tokens):
    """Update the tokens matched with token=regex pairs."""
    self.tokens.update(tokens)
    self.__compile()
  
  def ignore(self, key):
    """Ignore a particular token when it is found."""
    self.ignores.add(key)

  def unignore(self, key):
    """Remove a token from the ignores list."""
    self.ignores.discard(key)


if __name__ == '__main__':
  import doctest
  doctest.testmod()

Rolling Dice

Monday, July 21, 2008

When you don’t have a set of dice lying around, or when 20d6+2d4 just takes too long to add up:

#!/usr/bin/env python
import logging
import random
import re
import sys

ROLL = re.compile(r'(\d*)d(\d+)')

def roll(dice):
  """Randomly generate the result of a dice roll.
  
  >>> 1 <= roll('d6') <= 6
  True
  >>> 4 <= roll('4d6') <= 24
  True
  >>> 2 <= roll('1d10+1d4') <= 14
  True
  >>> 2+1-1 <= roll('2d8+1d6-1') <= 16+6-1
  True
  """
  logging.debug("roll(%r)", dice)
  def _roll(m):
    x = m.group(1) and int(m.group(1)) or 1
    y = int(m.group(2))
    return '%d' % (x * random.randint(1, y))
  postroll = ROLL.sub(_roll, dice)
  logging.debug("roll(%r) -> %r", dice, postroll)
  
  value = eval(postroll, {}, {})
  logging.debug("eval(%r) -> %r", postroll, value)
  return value

def argz():
  """Define the arguments to this script."""
  from optparse import OptionParser
  parser = OptionParser()
  parser.set_defaults(log_level=logging.WARNING)
  
  parser.add_option('-d', '--debug',
    action='store_const', const=logging.DEBUG, dest='log_level')
  parser.add_option('-v', '--verbose',
    action='store_const', const=logging.INFO, dest='log_level')
  parser.add_option('-q', '--quiet',
    action='store_const', const=logging.ERROR, dest='log_level')
  
  parser.add_option('--test',
    action='store_true', dest='test', default=False,
    help="Run the unit tests.")
  
  parser.add_option('--stat',
    action='store_true', dest='stat', default=False,
    help="Roll 4d6, drop the lowest (for ability scores).")
  
  (options, args) = parser.parse_args()
  
  logging.basicConfig(level=options.log_level)
  
  if options.stat:
    dice = [roll('d6') for x in xrange(4)]
    logging.info("rolled %r for ability", dice)
    dice = sorted(dice)[1:]
    logging.debug("kept %r", dice)
    stat = sum(dice)
    print "%d (%+01d)" % (stat, (stat - 10) / 2)
    sys.exit(0)

  if options.test or len(args) == 0:
    import doctest
    doctest.testmod()
    sys.exit(0)
  
  return (options, args)

if __name__ == '__main__':
  (options, args) = argz()
  for dice in args:
    print roll(dice)

You can even roll stats!

$ for stat in STR DEX CON WIS INT CHA EXT
> do ./roll.py --stat
> done | sort -n | tail -n6
13 (+1)
14 (+2)
15 (+2)
16 (+3)
17 (+3)
17 (+3)

Min-Maxing Mastermind

Monday, September 10, 2007

In Mastermind, the CodeMaker determines a sequence of four pegs that the CodeBreaker must deduce. The codebreaker gets ten guesses, but each guess is followed by a clue from the codemaker, letting the former know how many pegs are the right color, and how many of the right color are in the right position.

The maker gives this information in the form of up to four pegs. A red peg means that the breaker got one color in the right position, but provides no further information. A white peg only means that one color is in the wrong position. No pegs means that no colors match the code.

data Peg = Red | Green | Blue | White | Yellow | Orange
  deriving (Eq, Ord, Show)
data Code = Code [Peg] deriving Show

There are 1296 (6*6*6*6) possible codes, where order does matter.

code_space = [Code [a,b,c,d] | a <- ps, b <- ps, c <- ps, d <- ps]
  where ps = [Red, Green, Blue, White, Yellow, Orange]

In response to a guess, the codemaker places some number of red and white pegs into the board. The red pegs correspond to the number of colors that match position between the breaker’s guess and the maker’s code. The white pegs correspond to colors that are in both codes, but in different positions.

diff (Code x) (Code y) = Respond (reds, whites)
        -- Reds gives the number of slots that match between two codes.
  where reds = length [t | t <- zip x y, fst t == snd t]
        -- Whites gives the number of remaining colors in wrong slots.
        whites = (length $ Main.intersect x y) - reds

Note: computing the number of white pegs to return requires an interesting intersect function. It should preserve duplicates, but only so far as they exist in both lists. The intersect function in the Data.List module is broken for this purpose. intersect [2,2,3] [1,5,2] returns [2,2], but the lists only share a single 2. I had to roll my own, which wasn’t too difficult.

intersect x y = intersect' (sort x) (sort y)
intersect' [] _ = []
intersect' _ [] = []
intersect' (x:xs) (y:ys)
  | (x == y) = x : (intersect' xs ys)
  | (x < y) = intersect' xs (y:ys)
  | (x > y) = intersect' ys (x:xs)

Edit: On Reddit, I was asked the following:

It isn’t entirely your fault as the standard library seems to have a similar problem but shouldn’t intersect be a set operation?

Isn’t the reason that it is an “interesting” intersect function the fact that it isn’t an intersect operation in the traditional mathematical sense of the word at all?

Yes, intersect should be a set operation (that is, removing duplicates), in the normal sense of the word.

Since the Data.List version of intersect isn’t a “set intersection,” and the documentation notes that “if the first list contains duplicates, so will the result,” I assumed it would work in a non-surprising way; the way I ended up implementing it (see below). The intersection of two sets returns only elements existing in both sets. The intersection of two lists, therefore, should return elements existing in both lists.

The way that the library actually implements it is very surprising — intersect as implemented in Data.List isn’t commutative:

intersect [2,2,4] [1,2,3] => [2,2]
intersect [1,2,3] [2,2,4] => [2]

That is why it’s “interesting” to me: the principle of least surprise has no place in that implementation.


There are 13 possible responses to a guess, where order doesn’t matter:

data Response = Respond (Int, Int) deriving Eq

This means that each guess is either correct, or leads to one of 13 different sets of remaining guesses, depending on the response of the maker. The optimal solution is to always guess the code that will lead to the smallest remaining set. That is, if I can guess code A or code B, and A will lead to at most 50 codes, B to at most 75, I should guess code A.

The simplest way to choose a guess, then, is to brute-force determine this information from what I know about the remaining codes in the solution space. This way of brute-forcing is also used to solve most other games, like chess or checkers. The idea is to create a game tree, with all possible moves for the current player followed by all possible moves from that state for the opposing player, and so forth. Choose the move that is guaranteed to put the current player in a good final state, and do it.

In Mastermind, these moves are guesses and responses. A guess is a choice of one code. When no clues are given, this code is chosen from the entire solution space.

space `when` [] = space

Of course, each response given by the codemaker cuts this space down — the only remaining codes must give the same response as the maker gave.

space `when` ((code, response):rest) =
  [c | c <- space, diff c code == response] `when` rest

Since each code partitions the solution space into 13 or so different groups…

space `pivot_on` code =
  [space `when` [(code, response)] | response <- all_responses]
  where all_responses = [Respond (r, w) | r <- [0..4], w <- [0..4]]

…, we want to choose the code with the smallest maximum partition. That is, choose a guess that is guaranteed to reduce the solution space after the maker responds. This is the complicated part. (In comprehension, fortunately not in code.)

choose_from space =
  fst $ minimumBy (comparing snd)
  [(code, maximum $ map (length) (space `pivot_on` code)) | code <- space]

Now, just talk to the user and solve for the code.

solve_with clues = do
  putStrLn ("It looks like there are "
        ++ (show $ length code_space') ++ " possible codes left, "
        ++ "after " ++ (show $ length clues) ++ " clues.  Hmm.")
  putStrLn ("I'll guess " ++ (show best_guess) ++ ".")
  putStrLn "How many colors are in the correct location?"
  reds <- getLine
  if (read reds) == 4 then putStrLn "Woohoo!"
    else do
      putStrLn "How many colors are in incorrect locations?"
      whites <- getLine
      solve_with ((best_guess, Respond (read reds, read whites)):clues)
  where code_space' = code_space `when` clues
        best_guess = choose_from code_space'

Yes, all of this code is legal Haskell. Copied and pasted to a file, it will run just fine, and usually solve codes in two or three guesses (most I’ve gotten to is five).