Hello Haskellers! We mentioned in the last digest that we'd have just a tiny bit more to say about Parallel Haskell. As promised, here is the completed word of month on actors and their use in Cloud Haskell. It so happens — what a coincidence! — that Well-Typed's Edsko de Vries has recently published a beta version of the new distributed-process implementation on Hackage. We'd love it if you could give it a try let us know any trouble you ran into or ways we could improve things. To help push things along a bit, this word of month will be using the new distributed-process implementation.
Also, have you had a chance to fill out the Parallel Haskell Digest Survey? It's collecting data for another couple of weeks. Anything you can tell us in the survey will inform future efforts in building the Haskell community, so if you've got a couple of minutes before Cloud Haskell Time, head over to
Parallel Haskell Digest Survey
Many thanks!
Word of the month
The word of the month series has given us a chance to survey the arsenal of Haskell parallelism and concurrency constructs:
- some low level foundations (sparks and threads),
- three ways to do parallelism (parallel arrays, strategies, dataflow),
- and some concurrency abstractions (locks, transactions, channels)
The Haskell approach has been to explicitly recognise the vastness of the parallelism/concurrency space, in other words, to provide a multitude of right tools for a multitude of right jobs. Better still, the tools we have are largely interoperable, should we find ourselves with jobs that don't neatly fit into a single category.
The Haskell of 2012 may be in a great place for parallelism and concurrency, but don't think this is the end of the story! What we've seen so far is only a snapshot of the technology as it hurtles through the twenty-tens (How quaint are we, Future Haskeller?). While we can't say what exactly the future will bring, we can look at one of the directions that Haskell might branch into in the coming decade. The series so far has focused on things you might do with a single computer, using parallelism to speed up your software, or using concurrency abstractions to preserve your sanity in the face of non-determinism. But now what if you have more than one computer?
Actors
Our final word of the month is actor. Actors are not specific to distributed programming; they are really more of a low level concurrency abstraction on a par with threads. And they certainly aren't new either. The actor model has been around since the 70s at least, and has been seriously used for distributed programming since the late 80s with Erlang. So what makes an actor an actor? Let's compare with threads
Actor | Thread |
---|---|
can create more actors | can create more threads |
can have private local state | can have private local state |
has NO shared state (isolated from other actors!) | has limited shared state |
communicates with other actors via asynchronous message passing | communicates with other threads via shared variables |
The essential difference between actors and threads is the isolation and message passing. There aren't any holes punched into lids here, but you can always shine a message from one jam jar to another, perhaps hoping they send you one of their own. The appeal of actors is thus a kind of simplicity, where avoiding shared state eliminates a class of concurrency bugs by definition, and where each actor can be reasoned about in isolation of its brethren.
This sort of thing may perhaps strike a chord with us functional programmers, and actually, there is quite a bit of actor-related work in Haskell: a handful of packages offering the actor as concurrency primitive, Martin Sulzmann's multi-headed twist on the model; Communicating Haskell Processes exploring an actor-ish cousin known as CSP. Finally, there's Cloud Haskell, which in explicit homage to Erlang, applies the actor model to distributed programming.
Glimpse of Cloud Haskell
We'll be taking a quick look at Cloud Haskell in this word of the month,
unfortunately with only the most fleeting of glimpses. If squirting
money between bank accounts is the transactional hello world, playing
ping pong must surely be its distributed counterpart. Before working up
to that, we first start with half a hello. The following example creates
three processes — “process” is the Erlang-inspired word for the actor
here — one which receives Ping
messages and just prints them to
screen, one which sends a single Ping
message, and finally one which
fires up the first two processes:
{-# LANGUAGE DeriveDataTypeable #-}
module Main where
import Control.Concurrent ( threadDelay ) import Data.Binary import Data.Typeable
import Control.Distributed.Process import Control.Distributed.Process.Node import Network.Transport.TCP
-- Serializable (= Binary + Typeable) data Ping = Ping deriving (Typeable)
instance Binary Ping where put Ping = putWord8 0 get = do { getWord8; return Ping }
server :: ReceivePort Ping -> Process () server rPing = do Ping <- receiveChan rPing liftIO $ putStrLn "Got a ping!"
client :: SendPort Ping -> Process () client sPing = sendChan sPing Ping
ignition :: Process () ignition = do -- start the server sPing <- spawnChannelLocal server -- start the client spawnLocal $ client sPing liftIO $ threadDelay 100000 -- wait a while
main :: IO () main = do Right transport <- createTransport "127.0.0.1" "8080" defaultTCPParameters node <- newLocalNode transport initRemoteTable runProcess node ignition
This little package gives us a chance to look at three big pieces of
Cloud Haskell, the Serializable
typeclass, the Process
monad, and
channels.
Serializable
Actors send messages to each other. As programmers, we see the messages
in nice high-level form (eg. Ping
), but somewhere along the way, these
messages are going to have to be encoded to something we can ship around
on a network. Cloud Haskell makes this encoding explicit, but reasonably
convenient at the same time. Things can be messages if they implement
the Serializable
typeclass, which is done indirectly by implementing
Binary
and deriving Typeable
. You won't be starting from scratch, as
implementations are already provided for primitives and some commonly
used data structures.
Things which don't make sense as messages are deliberately left
unserializable, for example MVar
and TVar
, which are only meaningful
in the context of threads with a shared memory. Our Cloud Haskell
program is perfectly free to use these constructs within processes (or
within processes on the same machine; a bit more on that below), just
not to ship them around.
Process
We use “process” to mean “actor” in a similar fashion as Erlang, in
other words nothing nearly so heavy as an operating system process. One
different with Erlang, however, is that Cloud Haskell allows for both
actor style concurrency and the thread-based approach. The
infrastructure gears you towards using the actor model when talking
across machines, but on the same machine, you could also conveniently do
things the old way. Want to use STM to pass notes between processes?
Fine, just spawn them locally via spawnLocal
and give them a common
TVar
.
As for the Process
monad, we see again the idea of special monad
either for special kinds of sequencing. Here the idea is that things
like sending/receiving messages or spawning other processes only makes
sense for processes, and so you can only do these things in a “process
context”. Process
implements MonadIO
, though, so any input/output
you'd like to do within a process is merely a liftIO
away. Going the
other way, running a process from IO, you would do with the runProcess
function.
Channels
Cloud Haskell provides a notion of channels (somewhat similar to those we introduced in the last word of the month), typed unidirectional pipelines that go from one process to another. Using them is optional (there are simpler ways to bop messages back and forth), but worth trying out for the promise of sending messages only to processes that will understand them. Below is a quick glance at channels in action:
data SendPort a -- Serializable data ReceivePort a -- NOT Serializable
newChan :: Serializable a => Process (SendPort a, ReceivePort a) sendChan :: Serializable a => SendPort a -> a -> Process () receiveChan :: Serializable a => ReceivePort a -> Process a
A channel comes with a send and receive port, both of which are
parameterised on the same type variable. Creating a Ping
channel thus
gives a ReceivePort Ping
out of which only Ping
's will ever emerge,
and a SendPort Ping
into which we can only put Ping
's. This looks a
lot more attractive when you work with multiple channels. Replying to
pings with pongs, for example, would require us to create a second
channel with a send a receive port of its own, which means we have now 4
ports to juggle! Having the type distinctions makes things a bit
clearer: SendPort Ping
vs ReceivePort Ping
vs SendPort Pong
, vs
ReceivePort Pong
.
Finally, it's worth noticing that SendPort
's are themselves
Serializable, meaning that they can be copied and shipped around to
other processes possibly on other computers. This allows a channel to
accept data from more than one place, and also makes for idioms like
including a reply-to SendPort
in your messages. ReceivePort
's on the
other hand are (deliberately) left unserializable which leaves them tied
to single computer.
Ping? What happened to Pong?
Our little example was more “hello wo” than “hello world”; we'd only
managed to send a Ping
without even thinking about sending Pong
's
back. Want to try your hand at Cloud Haskell? Here's a great
opportunity!
1. [Easy] Start with a cabal install distributed-process
and make
sure you can run this example. Note that you'll need GHC 7.4.1 and
up for this
2. [Less easy] Next, add a new Pong
message (as a separate data
type), extending the server to send this message back, and the
client to receive that reply. There are some puzzle pieces to work
through here. How does the server know where to send its replies?
Moreover, how do we keep the server nice and decoupled from the
client? We want it to receive pings from any client, and send a
reply back to the ping'er (and not just some hard-coded client).
Hint: you can solve this without touching ignition
or main
.
Remember that SendPort
is Serializable
!
3. [Easy] You now have a single ping/pong interaction. Can you
make the game go back and forth indefinitely (or until the
threadDelay
ends)? Hint: have a look at Control.Monad
; it's not
essential, but it's a bit nicer.
Conclusion
Stepping back from the technology a bit, we have introduced the notion of actors as a concurrency abstraction on a par with threads. While there's nothing that makes them specific to distributed programming, they do seem to fit nicely to the problem and have been used to great effect before. Cloud Haskell is one attempt to apply this actor model, taking some of the ideas from Erlang, and combining them with Haskell's purity and type system.
You might notice that in a word of the month about distributed
programming, we've kept things on a single machine, alas! Indeed, we
have not been able to do Cloud Haskell justice in this article, but we
have hopefully laid some foundations by introducing some of the basic
layers, Serializable
messages, processes, and channels. To escape from
one-machine-island, we would need to get to grips with two more
concepts, nodes and closures.
Nodes can basically be thought of as separate machines (you could run multiple nodes on the same machine if you wanted to, say for development purposes). This makes for three layers: nodes (machines), which contain processes (actors), which can run any number of threads they wanted. We saw how processes can communicate by sending each other messages across channels; what we've left out is the crucial detail of what happens when the processes live on different nodes. The good news here is “nothing special”, still messages across channels. The bad news is a bit of infrastructural fiddliness setting up the nodes in the first place, assigning them to roles, and spawning remote processes… for which we need to know about closures.
The basic story with closures is that we need to be able send functions
back and forth in order to do anything really useful with Cloud Haskell,
and to send functions we need to say how they are Serializable
. This
would be easy enough — assume for now that all nodes are running the
same code and just send “run function foo
” style instructions — were
it not for the fact that Haskellers do all sorts of crazy things with
functions all the time (partially applying them, returning them from
other function…), crazy things that introduce free variables. Expressing
the serializability of function-and-its-free-variables was a source of
furious head-scratching for a while until somebody hit on the old Henry
T. Ford idea: You can have any free variables you want so long as they
are a ByteString.
Where to from here? If you're looking for more introductory stuff and
have not already seen, try Simon Peyton Jones's presentation of Cloud
Haskell to the Scala community
(1h video).
Edsko has been hard at work at the
distributed-process Haddock,
so it's worth checking out when you're ready to roll up your sleeves and
get hacking. It'd be a very good idea to have a look at the
simplelocalnet backend,
which will help you get started with the nitty gritty node management
issues when you start yearning to go distributed. That's the practical
stuff, but don't forget to read the
Cloud Haskell paper
either! The API has some slight differences (for example, ProcessM
has
since been renamed to Process
), but it should be fairly
straightforwardly transferable to the new package. It's likely we'll
need a wider spectrum of documentation to bring more Cloud Haskellers
into the fold (early days, eh?). Hopefully this word of the month will
help you get started, and maybe in turn write a blog post of your own?
Happy Distributed Haskell'ing!