This is the second post in a tutorial series on applying Parsec in historical linguistics. We’ve begun by providing a more formal description of sound change rule grammars and will end by building a full-fledged sound change applier.
In my last post we established a BNF grammar for files which describe sound change rules:
<file> ::= (<phoneme-class-defn> <EOL>)* (<rule> <EOL>)+
<phoneme-class-defn> ::= <phoneme-class> ":" <phoneme>+
<rule> ::= <context> ">" <replacement> ["/" <condition>]
<condition> ::= <context>_<context>
<context> ::= (<phoneme> | <phoneme-class>)+
<phoneme> ::= <lowercase-letter>
<phoneme-class> ::= <uppercase-letter>
Before we begin parsing, let’s set up some basic datatypes which can be used to store the parse results.
import Data.Map (Map)
-- A single phoneme.
type Phoneme = Char
-- Phoneme class storage, mapping from a single character ('V', 'A',
-- 'F', etc.) to a collection of phonemes.
type PhonemeClassMap = Map Char [Phoneme]
-- A string of phonemes used to match a given context.
type Context = [Phoneme]
-- A complete sound change rule.
data Rule = Rule { replacement :: [Phoneme],
beforeContext :: Context,
inContext :: Context,
afterContext :: Context }
instance Show Rule where
show (Rule r b i a) = show i ++ " > " ++ show r ++ " / " ++ show b
++ "_" ++ show aReferencing the BNF grammar, we can use these types to build the returns
for our parsers. Let’s start with the simplest Parsec rules,
anyPhoneme and anyPhonemeClass. Any uppercase character in sound
change rules should be interpreted as a phoneme class reference, and any
lowercase character must be a phoneme.
import Text.Parsec
anyPhoneme :: Parsec String () Phoneme
anyPhoneme = lower
anyPhonemeClass :: Parsec String () Char
anyPhonemeClass = upperAs evidenced by the given type annotations, our parsers (for the moment)
will have a stream type of String, a user state type of (), and a
return type that varies based on their purpose.
Our first lift
We need to next build the parser for phoneme class definitions. As a
first try, we could have our parser return a pair of type (Char,
[Phoneme]), matching with the type of a PhonemeClassMap. Let’s start:
phonemeClassDefinition :: Parsec String () (Char, [Phoneme])
phonemeClassDefinition = (,) (anyPhonemeClass) (many1 anyPhoneme)This doesn’t work! What gives?
Let’s look at the type of (,), a tuple constructor:
(,) :: a -> b -> (a, b)And check the type of anyPhonemeClass and many1 anyPhoneme:
anyPhonemeClass :: ParsecT String () Identity Char
many1 anyPhoneme :: ParsecT String () Identity [Phoneme]These have the right types Char and [Phoneme], except they’re
contained within a ParsecT type.
Good news: ParsecT s u m is a functor! This means that we can “lift”
functions into the context defined by the type. Check the type of fmap
and fmap (,):
fmap :: Functor f => (a -> b) -> f a -> f b
fmap (,) :: Functor f => f a -> f (b -> (a, b))You can check that the type of (,) corresponds with the type of fmap
(,). (In fmap’s type signature, b corresponds to b -> (a, b) from
(,)’s type.)
Let’s provide fmap (,) with that first argument f a, where f is
ParsecT String () Identity and a is Char. (This looks like the
type of anyPhonemeClass!)
fmap (,) anyPhonemeClass :: ParsecT String () Identity (b -> (Char, b))Great - just as fmap’s signature described, (,) was lifted into the
ParsecT String () Identity context and a was clarified to be a
Char. We can make our expression look a bit nicer by using an infix
alias for fmap from Control.Applicative, <$>:
(,) <$> anyPhonemeClass :: ParsecT String () Identity (b -> (Char, b))Applying within a context
Looking at the types, we’re almost there: we want our final parser to
have a return type of (Char, [Phoneme]) and the current parser has a
return type of b -> (Char, b). How can we supply a type b?
The answer comes from the fact that ParsecT s u m is not only a
functor but an applicative functor. This means that we can apply
functions already within the context (like b -> (Char, b)!) to values
within the context (like anyPhoneme!).
This contextual application is invoked by Control.Applicative’s <*>.
Compare its type with the type of $: the only difference is that <*>
operates within a context f.
(<*>) :: Applicative f => f (a -> b) -> f a -> f b
($) :: (a -> b) -> a -> bLet’s apply the lifted and partially applied function from the last
section to anyPhoneme:
anyPhoneme :: ParsecT String () Identity Phoneme
(,) <$> anyPhonemeClass :: ParsecT String () Identity (b -> (Char, b))
(,) <$> anyPhonemeClass <*> anyPhoneme
:: ParsecT String () Identity (Char, Phoneme)Close: our return type is (Char, Phoneme). Let’s apply with a many1
anyPhoneme instead, which will produce a parser that accepts one or
more phonemes.
(,) <$> anyPhonemeClass <*> many1 anyPhoneme
:: ParsecT String () Identity (Char, [Phoneme])Great! Our parser returns the proper type. Let’s write the actual implementation of our phoneme class definition rule before continuing:
import Control.Applicative ((<$>), (<*>))
phonemeClassDefinition :: ParsecT String () Identity (Char, [Phoneme])
phonemeClassDefinition = (,) <$> anyPhonemeClass <*> many1 anyPhonemeWe must do a bit of bookkeeping. In the original BNF, we stated that a phoneme class definition was of the form
<phoneme-class-defn> ::= <phoneme-class> ":" <phoneme>+
We need to account for the “useless” colon in this expression. It’s
useless in that it contributes nothing to the parse result. Using the
*> function from Control.Applicative, we can consume a ':'
character and discard its result:
import Control.Applicative ((<$>), (<*>), (*>))
phonemeClassDefinition :: ParsecT String () Identity (Char, [Phoneme])
phonemeClassDefinition :: (,) <$> anyPhonemeClass
<*> (char ':' *> many1 anyPhoneme)Modifying user state
There’s one significant problem left with this parser. True, it eats up strings without a problem:
> parse phonemeClassDefinition "" "V:aeiou"
Right ('V', "aeiou")Our problem is that we need to reference these definitions in another parser, specifically the context parser:
<context> ::= (<phoneme> | <phoneme-class>)+
Since Parsec has no idea of what a phoneme class is, when we build this parser we’ll need to identify exactly what we should look for in test words given that we saw “V” or “A” in a rule. How can we have the phoneme class definitions “carry over?”
It’s simple using Parsec’s built-in “user state” feature. (It shows up
in the u in ParsecT s u m a.) Rather than using () as our user
state type, let’s carry along a PhonemeClassMap as state. Each rule’s
type needs to now be redefined (but the implementation for those not
using the state data need not change):
anyPhoneme :: ParsecT String PhonemeClassMap Identity Phoneme
anyPhonemeClass :: ParsecT String PhonemeClassMap Identity Char
phonemeClassDefinition
:: ParsecT String PhonemeClassMap Identity (Char, [Phoneme])In phonemeClassDefinition we’ll need to use Parsec’s modifyState
function:
modifyState :: Monad m => (u -> u) -> ParsecT s u m ()This type annotation does a great job of helping us understand what
exactly happens within the function. Given some user state modifier
(i.e., a function which takes an old user state of type u and creates
a new one), a new parser is yielded which has a user state of type u
and returns nothing.
Now phonemeClassDefinition will return nothing and instead modify the
parser’s state (i.e., add entries to the phoneme class map).
phonemeClassDefinition :: ParsecT String PhonemeClassMap Identity ()We want to modify this map by inserting an entry whose contents will
be equal . We run into a familiar problem, however, since insert was
not built explicitly for use with the ParsecT s u m context:
insert :: a -> b -> Map a b -> Map a bLet’s lift insert into the ParsecT s u m functor:
(<$>) insert :: (Functor f, Ord a) => f a -> f (a1 -> Map a a1
-> Map a a1)
insert <$> anyPhonemeClass
:: ParsecT String PhonemeClassMap Identity (a -> Map Char a
-> Map Char a)Close, like before: we can now provide fmap insert with a first
argument in the ParsecT s u m context, but the a1 in the type
annotation has no concept of context. Using <*> once more, we can fix
the problem:
insert <$> anyPhonemeClass <*> (char ':' *> many1 anyPhoneme)
:: ParsecT String PhonemeClassMap Identity
(Map Char [Phoneme] -> Map Char [Phoneme])Before continuing, let’s give a name to the parser created in this section.
modifier :: ParsecT String PhonemeClassMap Identity
(Map Char [Phoneme] -> Map Char [Phoneme])
modifier = insert <$> anyPhonemeClass <*> many1 anyPhonemeNotice that Map Char [Phoneme] is equivalent to PhonemeClassMap, or
our parser’s user state u. We just did all this work to lift and apply
insert within a context, but now, upon revisiting modifyState’s
type, we see we’ll need to head in the other direction:
phonemeClassDefinition :: ParsecT String u Identity ()
modifier :: ParsecT String u Identity (u -> u)
modifyState :: Monad m => (u -> u) -> ParsecT s u m ()Binding
If we simplify the types here, the next step should be obvious. (This is pseudo-Haskell.)
u :: PhonemeClassMap
f :: ParsecT String u Identity
a :: u -> u
b :: ()
modifier :: f a
modifyState :: Monad m => a -> m bWe need some function that, with an f a and a -> f b, derive an
f b. This sounds just like >>=, the monadic bind operation!
(>>=) :: Monad m => m a -> (a -> m b) -> m b
modifier >>= modifyState :: ParsecT s PhonemeClassMap m ()That’s it – we’ve found our definition for phonemeClassDefinition!
With some reformatting:
phonemeClassDefinition :: ParsecT String PhonemeClassMap Identity ()
phonemeClassDefinition = modifier >>= modifyState
where modifier = insert <$> anyPhonemeClass
<*> defn
defn = char ':' >> many1 anyPhonemeWe’ve finished with the hardest parser of the set. In the next post, we’ll tackle the remaining parsers, most of which are simple combinations of those constructed today.