AlterNats — High Performance .NET PubSub Client and How implement of optimized socket programming in .NET 6

7 min readJun 21, 2022

Last month, I’ve released new .NET PubSub library for NATS — Cloud Native, Open Source, High-performance Messaging System.

GitHub - Cysharp/AlterNats: An alternative high performance NATS client for .NET.

An alternative high performance NATS client for .NET. Zero Allocation and Zero Copy Architecture to achive x2~4…

github.com

AlterNats is more than three times faster than the official existing clients and five times faster than PubSub on StackExchange.Redis, and all the usual PubSub methods are zero-allocation.

PubSub is easy to do in C# with the Redis PubSub function. There is a proven library called StackExchange.Redis, and AWS, Azure, and GCP offer managed services, so it’s nice and easy, but I was a little doubtful that blindly using it is the way to go.

Redis is primarily a KVS, and PubSub is more of a bonus feature.

Lack of monitoring for PubSub
Clustering support for PubSub
Unbalanced pricing of managed service(PubSub no needs memory)
Performance

Since NATS is specialized for PubSub, it has a rich system for that purpose, and its performance seems to be perfect. The only drawback is that there is no managed service, but if you use NATS as a pure PubSub, you don’t have to think about the persistence process, so I think it is one of the easiest middlewares to use. (NATS itself can support guaranteed at-least / exactly once messaging with a feature called NATS JetStream, but you may need more storage to support it.)

However, the official client, nats.net, doesn’t look so good: it doesn’t support async/await and its API is archaic.

The reason why this is so is clearly stated in the ReadMe, and it is based on the same code base as the Go client for maintainability. Therefore, there are many parts that are not C#-like, and since Go and C# are written in completely different ways for performance, it doesn’t seem to be the best in terms of performance.

If that is the case, we decided that it would be better to make our own version completely dedicated to C#, so we made it. Compared to the official client, it does not support all features (it does not support JetStream, nor does it support TLS, which is expected to be essential for Leaf Nodes operation), but it is specialized for PubSub’s NATS Core, so that it can achieve the highest speed. There should be no lack of functionality when using PubSub.

Intro to AlterNats

AlterNats’s API fully adopts async/await and keeping C# native style.

NatsOptions/ConnectOptions is immutable record, it can use C#’s new with expression.

NATS also provides a standard protocol for receiving results. It may be useful in some cases to use it as a simple RPC between servers.

High-Performance Socket Programming

Use Better Socket API

The class that can handle network processing at the lowest level in C# is Socket. And if you want asynchronous, high-performance processing, you need to use callbacks while reusing SocketAsyncEventArgs well.

However, now there is an easy-to-use async/await method that does not require complicated SocketAsyncEventArgs. There are also many Async-enabled methods, and you need to choose the right one to use.

The easy way to tell is to use the ValueTask returned.

The API for returning ValueTask is internally AwaitableSocketAsyncEventArgs, which is used as the content of ValueTask, and this is used in a good way (it is returned to internal when awaited) to achieve efficient asynchronous processing without Task allocation. This is a great improvement over SocketAsyncEventArgs, which is very difficult to use, and I highly recommend it.

Also note that the synchronous API can receive Span, but the asynchronous API can only receive Memory (for the convenience of placing state on the heap). This is not limited to Socket programming, but is general to asynchronous APIs, and if the overall system is not well organized, the inability to use Span can become a barrier. Be sure to make sure that you can get around with Memory.

Binary code determination for Text Protocols

The NATS protocol is a text protocol similar to Redis and others. It can be easily cut out by string processing. It is easy to implement the protocol by using StreamReader, since all you have to do is ReadLine. However, since it is (UTF8) binary data that flows over the network, and stringing is an overhead, if performance is desired, it must be processed as binary data.

NATS can determine the type of message flowing in by the leading string (INFO, MSG, PING, +OK, -ERR, etc.). While it would be easy to split the string processing with whitespace if (msg == “INFO”), etc., such an overhead is not acceptable for performance reasons.

Since INFO is [73, 78, 70, 79], it’s not a bad idea to use Slice(0, 4).SequenceEqual to determine it; ReadOnlySpan<byte>’s SequenceEqual is crazy optimized(if data is long, uses SIMD). It is different from LINQ’s SequenceEqual.

But let’s look at it more greedily, all protocol identifiers sent by the server are within 4 characters. In other words, this is an easy state to convert to Int! So, this is what AlterNats’ message seeding code.

Since I don’t think it is possible to judge faster than this, we can say that it is theoretically the fastest. 3-character instructions also always have a space or line break immediately after them, so the following constants are used to judge them, including those characters.

Automatically pipelining

All writes and reads in the NATS protocol are pipelined (batch). This is easily explained by Pipelining in Redis. For example, if you send three messages, one at a time, and wait for a response each time, the many round-trips in sending and receiving will become a bottleneck.

In sending messages, AlterNats automatically pipelines them: using System.Threading.Channels, the messages are packed once into a queue, and a write loop retrieves them all at once and batches them. Once the network transmission is complete, the write loop approach is used to achieve the fastest write processing by batching the messages that have been accumulated while waiting for the transmission process to be completed again.

It is not only about round-trip time (although in the case of NATS, the Publish and Subscribe sides are independent, so there is no waiting for a response), but it is also highly effective in reducing the number of consecutive system call invocations.

.NET’s fastest logger, ZLogger, takes the same approach.

Many functionality into a single object

To implement such a PublishAsync method, we need to put the data into a write message object for queueing channel and hold it in the heap. We also need a Promise for an asynchronous method that waits until the write is complete.

To implement such an API efficiently, let’s pack all the functions into a single message object (internally named Command) that must be allocated.

This object (AsyncPublishCommand<T>) itself has the role (ICommand) to hold T data and write it as binary data to the Socket.

In addition, by being an IValueTaskSource, this object itself becomes a ValueTask.

Then, as a callback during await, it is necessary to flow to ThreadPool so as not to inhibit the write loop. Using the traditional ThreadPool.QueueUserWorkItem(callback), there is an extra allocation because it internally creates a ThreadPoolWorkItem and stuffs it into the queue. By implementing IThreadPoolWorkItem from .NET Core 3.0, you can eliminate the internal ThreadPoolWorkItem generation.

Finally, we now have one object that we need to co-locate, and we can pool that one object to make it zero-allocated. Object pooling can be easily implemented using ConcurrentQueue<T> or similar, but by making itself a Node in the Stack, it avoids having to allocate an array. The Stack implementation can also provide a lock-free implementation optimized for such cache use.

Zero-copy Architecture

The data to be Publish/Subscribe is usually serialized C# types to JSON, MessagePack, and so on. In this case, we inevitably exchange bytes[], for example, the contents of RedisValue in StackExchange.Redis are actually bytes[], and whether sending or receiving, we have to generate and hold bytes[].

To avoid this, it is common to cheat by moving bytes in and out of the ArrayPool to achieve zero allocation, but this still incurs the cost of copying. Zero-allocation is the goal, of course, but let’s work toward zero-copy, too!

AlterNats serializer requires IBufferWriter<byte> for Write and ReadOnlySequence<byte> for Read.

The Serialize method of System.Text.Json or MessagePack for C# has an overload that accepts IBufferWriter<byte>. The serializer directly accesses and writes to the buffer provided for writing to the Socket via IBufferWriter<byte>, thereby eliminating the copying of bytes[] between the Socket and the serializer.

On the read side, the class ReadOnlySequence<byte> is required, since the data received from the Socket is often fragmented.

A common pattern is to handle what is read by the PipeReader of System.IO.Pipelines, a library designed to make high-performance I/O easier. However, AlterNats did not use Pipelines, but used its own reading mechanism and ReadOnlySequence<byte>.

The Serialize methods of System.Text.Json and MessagePack for C# provide an overload that accepts an IBufferWriter<byte>, deserialize methods accept a ReadOnlySequence<byte>. In other words, modern serializers must support IBufferWriter<byte> and ReadOnlySequence<byte>.

NATS, very interesting middleware, I hope you will give it a try.

Also, please see the ReadMe on GitHub, where we are devising an architecture for a metaverse that utilizes AlterNats, etc.

MagicOnion is a networking framework based on gRPC and LogicLooper is a framework for server-side game loops, both provided by my company (Cysharp).