DraftKings Engineering

DraftKings Engineering

Follow publication

Entering Actor Model Nirvana with F# and TPL DataFlow

Anton Moldovan
DraftKings Engineering
12 min readSep 20, 2022

--

This article will talk about how we build stateful systems at DraftKings with F# using the Actor Model design pattern based on TPL (Task Parallel Library) Dataflow.

Table of contents

1. Intro
2. What is Actor Model?
3. Actor Model in .NET
4. Actor Model at DraftKings
5. Poor Man’s Actor Model in F#
6. Get Result from Actor (Ask Pattern)
7. Actor’s Event Stream via IObservable
8. Actor and Blocking Async IO
9. TPL Dataflow Benchmarks
10. Actor Model Nirvana (Actor Model + Data Storage = Actor Database)
11. Summary

Intro

sports offering propagation

Starting in 2017, our company began to actively build and use stateful API/ETL services for more efficient resource utilization under load, better failure tolerance, and availability. Initially, our use cases were mainly around the sports offering propagation. The domain of sports offerings can be presented as a change stream that contains updates for many sporting events and, most importantly, frequent odds changes. These changes should be processed and delivered to the users with near real-time (1–3 sec) latency. Let’s say, to process a game score change, we should find a corresponding game, modify it, and then distribute an updated message to all users watching this game. The hidden thing here is that many such changes happen in parallel or require concurrency for faster processing. Hence, our stateful services support concurrent CRUD and atomic Read-Modify-Write operations for in-memory states, requiring careful handling of multithreading and thread synchronizations.

In the past, the typical way of handling concurrency in a multithreaded environment was to use the basic thread synchronization primitives: locks/mutexes/semaphores. The main problem with this approach is its complexity, which is hard to control and follow. These low-level constructs can lead to extremely complex interactions between the various processes leading to contention, with the possibility of deadlock, race conditions, etc. Another problem is that it doesn’t scale well with the number of cores and can seriously impact performance. Developing concurrent applications requires an approach that departs from current software development methods, a practice that emphasizes concurrency, and communication between autonomous system components. The complexity of building multithreaded and concurrent systems leads many developers to use the actor model design pattern.

What is Actor Model?

In short, the actor model defines a system of software components called actors that interact with each other by exchanging messages, producing a system in which data flows through components to meet the functional requirements of the system. An actor is the primitive unit of computation that sends or receives a message and does some computation based on it. It also maintains its state. Actors behave much like objects. They react to messages and return execution when they finish processing the current message. You can find a detailed explanation of the actor model on Wikipedia, and I also recommend checking the first chapter of the book “Applied Akka Patterns”.

The main feature of the actor model is:

  • Safe state management ownership - nobody except the actor can modify or access its state
  • Non-blocking async message passing - usual usage of message passing avoids blocking(locking), the actor can send a message and continue processing without blocking
  • Process one message at a time - it’s an important guarantee that introduces single-threaded execution and removes many complexities associated with manual thread synchronization from the developer
  • Sequential request processing (FIFO queue) - brings an additional level of control and determinism; it simplifies understanding and predicting the execution flow

Actor Model in .NET

In the .NET world, there are at least a few battle-tested frameworks that allow you to build stateful systems using the actor model. The famous ones are Akka.NET, Orleans, Proto.Actor, and Dapr. All of these systems support distributed mode, as well as a bunch of other features like actor supervision, virtual actor, parent-child hierarchies, reentrancy, distributed transactions, dead letter queue, routers, streams, etc. In addition to these frameworks, .NET has several lightweight libs that do not have many of the listed features but still allow the use of the actor model. For example, F# has built-in support via Mailbox Processor and also there the toolkits: TPL Dataflow and Channel. TPL Dataflow and Channel are not actor model implementations but rather foundations for writing concurrent programs with the ability to use the actor model design pattern.

Actor Model at DraftKings

In our company, we use the actor model in several projects. It mainly depends on the application type, but often we use TPL Dataflow and sometimes Orleans actor framework. We don’t want to drag Orleans on all projects that could benefit from the actor model. There are several reasons for this. First, we always try to think carefully about what the framework offers and what the system needs. Second, in most cases, we prefer simplicity since frameworks like Orleans or Akka.NET require studying their abstractions, API, and many other hidden things, which usually results in the complexity we want to avoid. For example, from our experience, we find out that only a few actors are needed to model a stateful ETL service. And for such types of scenarios, we do not need full-fledged actor frameworks. Instead, we need something more straightforward and lightweight like TPL Dataflow. Now, you may ask, how do you scale your systems based on TPL Dataflow since this library doesn’t provide distributed mode? It’s a good question since we need to support scale-out to handle our workload with a hundred thousand messages per second. To scale out the stateful services, we use data partitioning to handle a limited portion of data with one instance of our service, rather than all data. How do we do the data partitioning? In our company, we use Kafka as a data streaming platform, and, to utilize this approach, we use partitioned topics.

Poor Man’s Actor Model in F#

So, why F#? We use F# in production, especially for stateful components (stateful ETL or stateful API) that work with streaming and processing the data. We even built our custom main-memory database in F#. In our experience, F# showed many advantages over popular languages(C#, Java). Immutability, expressiveness, and a powerful type system provide better controllability and predictability of a program. Modeling application domain logic, application state, state transitions, or utilizing a state machine approach(FSM) it’s just a pleasure to do with F#. Moreover, it has performance parity with C#, allowing you to use it for high-performant scenarios. And last but not least, it has a well-designed, small and ergonomic core of main features that simplifies the learning curve and code maintenance.

Now, let’s talk about the TPL Dataflow. The .NET team has been building and maintaining the powerful TPL Dataflow library for a long time. This library is designed for building asynchronous data processing applications. At its core, TPL Dataflow is a set of constructs built on top of the task parallel library. To use it, we need the most basic construct from this library - ActionBlock<Msg>, which provides the actor’s mailbox and single-threaded execution. The most basic example of its use is as follows:

So far, nothing unusual: we have created an actor that updates its state (Dictionary<K,V>). Now let’s look at how this actor will work in a multithreaded environment:

Let me remind you that in this example, our actor modifies the Dictionary<K,V>, which is not thread-safe, and concurrent operations might throw exceptions. But how can we safely modify regular Dictionary<K,V> in concurrent mode? We can do this because the only degree of parallelism that happens is publishing the messages to the actor’s mailbox (ActionBlock<Msg>). This is why using the regular Dictionary<K,V> is absolutely fine. You can say that for such cases, there is a CuncurrentDictionary<K,V>, and you can do the same without an actor model. You are right, but what if you need to update two dictionaries simultaneously (atomically)? To guarantee that when a thread atomically writes on shared data, no other thread can read the modification half-complete. You will most likely need to use the standard expensive and slow lock to do this.

In the actor model, your unit of atomicity is an actor. Therefore, you don’t have to worry about lock(s) and their resulting problems (deadlocks, etc.). And as a bonus, you end up with straightforward code. Most importantly, it’s much harder for you to shoot yourself in the foot.

Get Result from Actor (Ask Pattern)

Up to this point, we’ve seen how to modify an actor’s state, but we haven’t seen how we can read an actor’s state. Reading the actor’s state sounds simple, but the nuance is that we must read it in a thread-safe manner. To do so we should follow the same approach by publishing a message to the actor’s mailbox. It is clear how to send an Add(key, value) or Remove(key) message but how do we express the GetValue(key) operation? The standard approach in such cases is providing a reply channel that the actor will use to respond. Basically, we should define a GetValue(key, reply) message. But, what should we use to represent the reply channel? For this purpose, we can use TaskCompletionSource<Result> which helps us to build an asynchronous Task. The caller will wait for the actor to receive the GetValue(key, reply) message and set a result into this reply channel (reply.SetResult(value)).

This is what it looks like in use.

Actor’s Event Stream via IObservable

Another way you can retrieve data from actors is by exposing the actor’s event stream. This method is usually helpful if you have multiple listeners that want to respond to certain actor events. You can imagine the actor representing the room with the thermometer, and you want to handle the event that will be raised when the max threshold is reached. In this case, you may have several listeners that will be interested in handling such events differently and handling them in real-time.

Going back to our example, we will model a change stream that represents all changes that happen in our actor. Basically, after adding a new record or deleting it, our actor will throw corresponding events(Added and Removed). So let’s try to add two main events that our actor will produce.

Actor and Blocking Async IO

You may need to perform Blocking Async IO inside an actor for many use cases. A typical scenario would be to save an actor’s state to a database or handle an HTTP request. TPL Dataflow has everything you need to cover such cases using the familiar async-await. The ActionBlock<Msg> that we use to implement the actor supports a constructor that accepts the async message handler:

ActionBlock(receive: Msg -> Task)

which in turn allows us to process the message inside the actor asynchronously while maintaining single-threaded execution.

Let’s go back to our example. Imagine that we have a function saveActorState with the following signature:

saveActorState: Dictionary<string,int> -> Task

that can save the state of our actor. The save operation is asynchronous and will be executed in a blocking manner. Until it finishes, no other operation will be performed by the actor.

TPL Dataflow Benchmarks

For benchmarking, let’s use the actor from our examples (MyActor) that receives messages from different threads and adds elements to the Dictionary<K,V>. The main idea of this benchmark is to evaluate the throughput of TPL Dataflow in multi-threaded mode with high contention. Also, for comparison, we will look at how good the ConcurrentDictionary<K,V> is under high contention.

Here is the result when we bomb from multiple threads with high contention.

Here is the result when we bomb from a single thread.

Ok, let’s analyze the received results. The representative columns for us are:

  • mean - shows an average time value of the executed operation
  • allocated - shows all memory allocations during the benchmark
  • gen (0,1,2) - shows the number of GC collections per 1000 operations for that generation.

So from our benchmark, you can see that for 1 million operations, the TPL Dataflow (mean=126.18 ms) performed slightly faster than ConcurrentDictionary (mean=150.13 ms). Also, it allocated less memory: TPL Dataflow (allocated=76 MB) vs ConcurrentDictionary (mean=85 MB). The received results show that TPLDataflow provides solid throughput in multi-threaded mode with high contention.

You can find the source code of this benchmark at the following link.

Actor Model Nirvana

Finishing, I would like to mention my favorite research papers about the actor model used in the industry.

  1. Orleans: Distributed Virtual Actors for Programmability and Scalability
    It’s an interesting paper where the “Virtual Actor model” was first introduced. Initially, many people were skeptical about this kind of extension of the actor model, but with time this model was adopted and raised in popularity. From my personal experience, the virtual actor model vs. the regular one is like C# vs. C++. Even though Orleans propagates virtual actors as a default approach for modeling, it also supports many features that provide you with flexible and manual control over your actors.
  2. An architecture and implementation of the actor model of concurrency
    It’s a small paper that describes the initial architectural approach for implementing actors for Swift programming language. It also includes a short comparison between different actor implementations and integrations into the programming languages (Erlang, Kotlin, Pony). After reading this paper, I recommend looking at the final version of actor model integration into Swift. It seems to me that the Orleans actor framework with a combination of constructs (async/await + Task<T>) heavily inspired the authors of Swift.
  3. Deterministic Actors
    This paper describes the problem of the non-determinism behavior of the actor model and provides a solution to it. The problem boils down to the actor’s default way of communication with the following function signature: (publish: Message → unit). This function signature emphasizes the “fire and forget” communication style with consequences of non-deterministic behavior.

Actor Model + Data Storage = Actor Database

For some users, using the actor model leads to the replacement of components that speed up data access, such as caching layers. However, after playing with actors more, they usually conclude that actors should be treated as sources of truth. They can rely on the ACID guarantee that caching layers can’t provide. All such ideas and approaches lead to the phrase: Actor Model + Data Storage = Actor Database. I prefer to stop here, but some researchers (including Microsoft’s research department) do not think so. They started developing extensions that bring more DB-related features to the actor model.

  1. Actor-Oriented Database Systems
  2. Indexing in an Actor-Oriented Database
  3. Transactions for Distributed Actors in the Cloud
  4. Hybrid Deterministic and Nondeterministic Execution of Transactions in Actor Systems
  5. Geo-Distribution of Actor-Based Services
  6. Reactors: A Case for Predictable, Virtualized Actor Database Systems
  7. Anna: A KVS For Any Scale
    It’s a mind-blowing paper about building scalable KVS by using the actor model + CRDT for strong eventual consistency. Anna is a partitioned, multi-mastered system that achieves high performance and elasticity via wait-free execution and coordination-free consistency. If you ever think about modeling KVS using actors you will quickly discover that having an actor per key is a pretty powerful consistency guarantee. On the other hand, an actor by itself seriously limits bandwidth and is a big problem. Especially when you have many users (one to many relations) wanting to work with a single actor. In other words, the actor, as a scaling unit is not scaled. The authors of this publication propose to replicate (clone the actor state into several nodes) the actor which allows scaling out to the actor. In this way, multiple nodes in parallel will be able to work with their copy of the actor state. In order to keep these copies synched authors propose to use multi-mastered replication with a coordination-free approach based on CRDT. Spoiler :) With such architecture, Anna can give you only causal consistency but cannot provide strong consistency at a key level. From my perspective, it’s a limiting factor for many production use cases.
  8. Autoscaling Tiered Cloud Storage in Anna

Summary

That’s pretty much it. We covered how to use the actor model design pattern with TPL DatafLow, “ask pattern,” and how to add the actor’s event stream. All of these are pretty basic but are pretty powerful abstractions and you can benefit from using them in your daily work. You can access all the source code that I used in this blog post by the following link.

As a final thought, let’s discuss the pros and cons of using an actor model. Generally speaking, I would say that one of the main advantages that the actor model provides is a foundation for building concurrent and distributed systems based on high-level primitives. Hence, it shifts your modeling into well-defined and explicit processes that are easier to control (if you cook them right), reason about, and scale. The notion of scale is a big deal for modern systems that nowadays can’t be underestimated. The actor model scales well for single-node systems as well as for multiple node systems. A good candidate for applying the actor model will be any business application or distributed messaging system that requires concurrency.

All this is impressive, but are there reasons for not using the actor model? There are other techniques for building concurrency: Coroutines, Software Transactional Memory(STM), lock-free primitives, locks/mutexes/semaphores. Still, they are usually related to low-level things compared to the actor model. The actor model probably will not be the best candidate for building concurrent data structures or some low-level algorithms that can work faster with lock-free primitives. However, nowadays, we are less and less likely to encounter such challenges where we need to develop concurrent data structures from scratch for our business applications.

As a rule of thumb, if you don’t want a headache from debugging and fixing concurrency bugs or catching deadlocks, then the actor model can be a good solution for you. Of course, it will not magically solve all your issues, but it will help you prevent many of them.

--

--

Responses (1)

Write a response