iksemel rusted

I had written some Rust code before, but it was for command-line tools, Advent of Code solutions, and similar small-scale programs. I was looking for a suitable project to delve deeper into Rust and decided to rewrite an old C project of mine in Rust!

Iksemel was a fast and easy-to-use XML/XMPP library which I developed over two decades ago. It was written in highly portable C and used in all sorts of places, including: an instant messenger on the Amiga home computer; an embedded sensor with 64 kilobytes of memory that posted measurements to users on Google Talk; the Scientific and Technological Research Council of Turkey's Linux distribution's Pisi Package Manager via its Python bindings; and a bunch of Symbian devices.

Did you know that Google Talk used to speak the open XMPP protocol before it was enshittified?

I haven't worked much on it recently since neither the XML format nor the XMPP instant messaging protocol is widely used anymore. But it was perfect for the task. The document tree was managed with self-referential and intrusive data structures with custom memory management, a challenge to the borrow checker! There were lots of low-level tricks for performance. It was also a complete package with tests, documentation, tools, and Python bindings, hence many areas to explore.

You can find the shiny new Iksemel Rust code here and the preliminary Python bindings here.

The rewrite took longer than I expected. The official Rust documentation is decent for someone coming from high-level languages, and perhaps even for those with no programming background, but it is not enough for C folks. The necessary information is spread between blog posts, issue comments, source code, and the collective wisdom of the community. This motivated me to write this post.

Despite the learning phase, the overall development process was much nicer and faster than the C version! I remember constant crashes and hard-to-debug bugs. The Rust version was immediately crash-free despite the use of unsafe, most of the code worked like a charm on the first try, and debugging in rare cases was a breeze. This was a result of the combination of the strong type system, memory safety features, and the comprehensive test tooling of Rust.

I haven't had a chance to profile and optimize the actual parsing code, or take advantage of SIMD and other modern hardware yet, but the performance looks excellent already. All benchmarks are lies, so take this with a grain of salt, but it is faster than libxml2 and even the read-only roxmltree crate on several giant Wikipedia dumps I have tested with.

xmlThrDefParserInputBufferCreateFilenameDefault is an actual function name of libxml2.

Of the Borrow Checker

"Don't fight with the borrow checker!" is sound advice. Unfortunately there are some common and necessary programming patterns in C which initially seemed impossible to implement in Rust. I have found a four-step process to avoid fights:

0. Using values.

Solving the problem with clone()to_string(), etc., is usually the compiler's suggestion and general beginner advice. If the performance loss caused by this were generally acceptable, programs wouldn't be written in C in the first place, so this is not a solution C programmers would settle for. However, even in Iksemel, there were a few places where it did the job perfectly.

1. Disentangling the data.

In most cases, it is possible to get rid of circular references by refactoring, such as splitting a God object into parts with orthogonal responsibilities. This generally makes the program more testable, composable, and understandable as well, so it is a very good thing to do even if the final solution involves the next steps.

2. Using dedicated Rust facilities.

If the borrow checker lifetime model isn't working, the reference counting model can be used instead with Rc or Arc. If shared/interior mutability is desired, RefCell or Mutex exists. Pin can help with self-references. There are other goodies in the standard library and crates. They are well-implemented and their performance costs are negligible everywhere except under the most demanding hot loops. The majority of programs have no reason to look further beyond this step.

3. Using unsafe Rust.

If the language facilities are not enough, and performance is really important, this is the ultimate solution. There is a stigma attached to it, mostly because it is often used to bypass the borrow checker quickly without good justification, but it is perfectly fine to use when you have honestly worked through all the above steps and have a business requirement for the performance. After all, the standard library has a ton of unsafe code. You really must apply the same level of engineering discipline as the standard library, though. I will give a thorough checklist in a moment.

A good way to look at this is to understand that the safe Rust rejects all unsound programs but also rejects some sound programs because the borrow checker has both theoretical and practical limitations. The unsafe Rust on the other hand allows all sound programs, but also allows some unsound programs.

Another helpful approach is thinking unsafe as human-assisted type system.

Of Callbacks and Events

One of the entanglement points was the implementation of the parser hooks. Iksemel is built in layers, with a so-called SAX (serial access) parser which is more or less a tokenizer at the bottom, and the document and XMPP stream parsers on top of each other. When the SAX parser produces an element, it calls the registered hook with the element information. The API is exactly what you would expect from C:

typedef int (iksTagHook)(void *user_data, char *name, char **atts, int type);
typedef int (iksCDataHook)(void *user_data, char *data, size_t len);
iksparser *iks_sax_new (void *user_data, iksTagHook *tagHook, iksCDataHook *cdataHook);
int iks_parse (iksparser *prs, const char *data, size_t len, int finish);

The parser calls these registered hooks with the provided user_data to allow upper level parsers to maintain their state. Now, there are ways to implement these sorts of callbacks with Fn/FnMut/FnOnce and generics, but I realized how limiting this kind of API is going to be while playing with some examples, and implemented this instead:

pub fn parse_bytes<'a>(
        &'a mut self,
        bytes: &'a [u8],
    ) -> Result<Option<(SaxElement<'a>, usize)>, ParseError>

The return type looks a bit complicated but compiles into a very compact structure thanks to niche optimizations. In plain language, it either returns an error, a result of a pair of an encountered element and how many bytes of the buffer are processed, or a result of None saying that all bytes are processed without producing a complete element yet.

Note that the returned element is a reference and has the same lifetime as both the parser and the input buffer. This allows the parser to return a reference from either the input data or from the internal parser buffers if data needs to be modified, thus avoiding memory copies while ensuring that the caller cannot free either source before dealing with the returned reference.

Since the callback mess is gone, infinite layers of higher-level modules can be built on top without angering the borrow checker. In fact, the document tree parser mimics exactly the same API:

pub fn parse_bytes(&mut self, bytes: &[u8]) -> Result<(), ParseError> {}

pub fn into_document(mut self) -> Result<Document, ParseError> {}

All the incoming bytes are passed to the SAX parser internally, returned elements are inserted into the tree, and once everything is parsed, the final document could be retrieved.

Rust's ability to efficiently return complex types makes it easy to implement this kind of APIs, which are highly suitable for Sans-IO.

Of Lending Iterators

Not everything is perfect yet. One downside to this API is that the caller must handle the bookkeeping for the processed bytes and repeatedly call the function. I wanted to implement the standard Iterator trait for this so it could be used with a for loop; unfortunately, the trait does not support returning an item with a lifetime tied to the iterator itself.

This requires a pattern known as a "lending iterator," which is possible in Rust but incompatible with the standard Iterator trait. Here is the implementation:

impl<'a> SaxElements<'a> {
    pub fn new(parser: &'a mut SaxParser, bytes: &'a [u8]) -> Self { ... }

    pub fn next(&mut self) -> Option<Result<SaxElement<'_>, ParseError>> { ... }
}

It cannot be used with a for loop, but it is still very easy to use with while let:

let mut elements = parser.elements(b"<doc>example</doc>");
while let Some(result) = elements.next() {
     println!("Element parsed: {:?}", result?);
}

Of Self-Referential Structs

Iksemel stores the XML data inside a custom memory arena containing two bump-allocator areas: one for the element structs and one for the character data. Container structs are allocated within the arena chunks, and the element structs have internal pointers forming the tree structure, which makes navigation fast and easy. Reducing the allocation count and packing similar objects together like this has a huge performance impact, but it cannot be done with Safe Rust at the moment.

There are some crates like Bumpalo which implement bump allocation for the general case, but this wasn't enough for Iksemel. There are also a couple of crates for implementing self-referential structs, but I found they obscured what was going on due to heavy macro use, and they weren't flexible enough either.

Of Unsafe

Unsafe code can introduce undefined behavior. We don't want to go back to the misery of C by doing that, so I have collected an absolute minimum list of rules to ensure success.

0. Before even doing anything, the following resources must be internalized:

Pointer provenance:

1. Unsafe code must be contained within a safe API.

This is how Safe Rust is possible in the first place: by encapsulating the dangerous parts in an impossible-to-misuse abstraction. This API has two jobs: keeping the internal invariants of the unsafe code true, and not leaking anything to the outside which breaks the invariants of Safe Rust.

As an example, the XML document tree is represented as the struct Document, and any editing and navigation operations return a struct Cursor to move around the document and perform operations. The Document contains all the data, and the Cursor is essentially a reference to a certain node of the document tree.

let doc = Document::fromStr("<a><b>lala</b></a>");
let cursor: Cursor = doc.find_tag("b").first_child();
assert_eq!(cursor.cdata(), "lala");

These few lines contain a bunch of invariants to uphold. The biggest internal invariant is that the backing memory area is not released back to the system while the cursor is pointing to it. Thankfully, Rust already solves this problem with lifetimes:

pub struct Cursor<'a> {
    node: *mut Node,
    arena: &'a Arena,
}

impl Document {
    pub fn find_tag<'a>(&'a self, name: &str) -> Cursor<'a> {
        // ...
    }
}

Earlier versions of this code had PhantomData<Document> to link the lifetimes without an actual data member. Now it has a reference to the Arena member of the Document for reasons explained later in this post. Either way, this API ensures that the returned Cursor cannot stay alive longer than the Document. When the Document is dropped and the memory is freed, all the pointers into it are already gone.

On the other hand, a C compiler would happily compile this without an error:
iks *document, *cursor;
int e;
document = iks_tree("<a><b>lala</b></a>", &e);
if (IKS_OK == e) {
    cursor = iks_find(document, "b");
    iks_delete(document);
    printf("crash here = %s", iks_name(cursor));
}

Another internal invariant is that no modifications via other cursors can invalidate an existing Cursor. This is ensured by the design which leaves the modified or deleted nodes intact in place and just hides them by changing internal links. These links can only be followed by one Cursor at a time, which is further ensured by not declaring cursors as Sync.

External invariants are a little bit more complicated. Thankfully, there is enough guidance in the Rust Reference and the Rustonomicon book. You might have noticed that Cursor::cdata returns a string reference to the internal buffer. It uses the &str type to avoid copying. Therefore, it follows &str invariant rules by: a) attaching its lifetime to the Cursor to prevent early free, b) ensuring that the returned pointer is never null, and c) ensuring all the strings inside the document character data memory are valid UTF-8 strings at all times. 

2. There is Unsafe and there is YOLO.

Iksemel does use raw pointers, but they are all typed. When there is a need to calculate the size and alignment of an object to make space, the Layout module from the standard library is used instead of making assumptions. No conversion between a pointer and an integer is done. Pointer arithmetic to find space in the arena uses *const::add_byte and never jumps from one allocation area to another. std::mem::transmute is never used. The same memory area is never mapped to different structs.

The main theme between these avoided operations is that they challenge the compiler's model of data representation in memory. While there are rare cases in the standard library where they are needed, they are considerably riskier than regular unsafe use.

3. Unsafe code must have negative tests.

It is always a good thing to check that things which shouldn't happen are indeed not happening. This is quite easy with Cargo tests and doctests for runtime invariants. Testing that the compiler rejects API misuse turned out to be a bit harder. The best way I found to do this is to use rustdoc tests with the compile_fail flag like this:

/// Cursor clone cannot outlive the Document:
/// ```compile_fail
/// let c2: Cursor;
/// {
///     let doc = Document::from_str("<a><b/></a>")?;
///     let c1 = doc.root();
///     c2 = c1.clone();
/// }
/// println!("{}", c2);

Unfortunately, you cannot specify the expected error, so occasionally you have to run cargo test -- --no-capture to check that they are not failing for other reasons. I haven't tried it yet, but this could be a good candidate to automate via snapshot testing.

This check would have been completely unnecessary in Safe Rust because the borrow checker wouldn't let you return anything from c1 which references its own data without proper lifetime association. But since the compiler cannot check the intention behind the raw pointers, a simple typo in the method signature in unsafe Rust can detach two lifetimes from each other with disastrous results.

4. All unsafe code must be tested under Miri

Miri is a sort of interpreter which runs your Rust code and checks for any unsound operations at runtime. Note that the only way it can be as effective as the Safe Rust compiler is to exercise all codepaths. That means you must have a comprehensive test suite.

It is quite slow; Iksemel's test suite takes a second for the CI run, but takes 18 minutes with Miri. Still, the release action of the Iksemel crate is gated on a successful Miri run, and I run Miri whenever I make changes to the unsafe code as well.

5. Tests must be tested, preferably with cargo mutants

Miri is useless unless you execute all codepaths under inspection. There are different ways to ensure test coverage, but I found that mutation testing works better than others. Classic code coverage gives more importance to a hundred-line print banner function than to a single complicated conditional expression. More on that in the next chapter.

6. Unsafe code must have examples and documentation.

Using the library you wrote is super simple. Trying to explain how to use it to someone else, on the other hand, makes you realize even more corner cases, safety issues, and all sorts of awkwardnesses in your API.

7. Code must be idiomatic and heavily linted.

I wrote the Iksemel C code in GNU C style, which wasn't a great choice in hindsight, but it was still a lot better than not having a style at all. I'm happy that Rust syntax is a lot less flexible and tools like cargo fmt apply a highly opinionated style to everything. It helps when reading other people's code, using examples, and making unexpected things stand out amongst the surrounding lines. The last bit is especially important here since unsafe code depends on human review.

Rust also has a ton of lints to check problematic patterns and non-idiomatic code. Some of these are not always applicable, but it is generally easier to enable all and give permission with the expect attribute when you have a good reason to ignore them.

Iksemel enables all of them except the pedantic and restriction categories, and then cherry-picks some of those as well:

#![deny(clippy::suspicious)]
#![deny(clippy::complexity)]
#![deny(clippy::perf)]
#![deny(clippy::style)]
#![deny(clippy::cargo)]
#![deny(clippy::items_after_statements)]
#![deny(clippy::missing_panics_doc)]
#![deny(clippy::uninlined_format_args)]
#![deny(clippy::unnecessary_semicolon)]
#![deny(clippy::unreadable_literal)]
#![deny(clippy::allow_attributes_without_reason)]
#![deny(clippy::panic)]
#![deny(clippy::partial_pub_fields)]
#![deny(clippy::redundant_test_prefix)]

The allow_attributes_without_reason lint is particularly useful. It forces you to explain the rationale whenever you have to ignore lints like this:

#[expect(
    clippy::inherent_to_string_shadow_display,
    reason = "prereserving exact capacity makes this method significantly faster"
)]
fn to_string(&self) -> String { ... }

Of Mutation Testing

The following snippet is simplified from the unsafe code which handles the removal and insertion of an attribute using a classic linked list approach:

// in the attribute removal code
if (*tag).last_attribute == attr {
    (*tag).last_attribute = (*attr).previous;
}

// in the attribute insertion code
if !(*tag).last_attribute.is_null() {
    (*(*tag).last_attribute).next = attribute;
    (*attribute).previous = (*tag).last_attribute;
}
(*tag).last_attribute = attribute;

I was surprised when cargo-mutants showed a miss for the following mutation:

if (*tag).last_attribute != attr {

To get full coverage for the four branches in these two if statements, it is sufficient to test adding three attributes, removing the middle one, and then removing the last one. I had more tests than that, but the crucial scenario of removing the last element and then adding it back was forgotten!

Note that the bug exposed by this mutation depends on the order of operations, and mutant testing was able to find the testing gap quite easily.

Of Interior Mutability

I changed the Cursor API several times as I played with it while implementing higher-level parsers and command-line tools. The first iteration was like this:

#[repr(transparent)]
struct Cursor<'document> {
    node: *mut Node;
    _phantom: PhantomData<Document>;
}

This implementation was good enough for the navigation methods. But the editing methods required access to the backing Arena structure, which was inside the Document. The C version didn't have the Cursor and Document separation; every node contained an arena pointer and the same iks* type referred to both a complete document and individual nodes, but that approach wasted a pointer per node. So, to avoid going back, I tried the following:

let doc = Document::from_str("</test>");
doc.root().insert_cdata(&doc, "hello")?.append_cdata(&doc, " world!")?;

A worthy effort, but futile! The problem was the cursor lifetime we connected to the document. The borrow checker sees that and does not allow the document to be reborrowed by the same owner. Running away from the fight but keeping the lifetime safety, I changed it to:

pub struct Cursor<'a> {
    node: *mut Node,
    arena: &'a Arena,
}

It felt like the structure size increased at first, but in practice, the compiler can easily analyze and optimize out the arena reference copying.

At this point, astute readers have already noticed that there is something missing with the arena reference, or with the Cursor itself, for that matter. Both are shared references! So how can they allocate memory or edit the document?

It is quite useful in practice to have two or more cursors pointing to different elements while traversing and editing XML documents. This doesn't play well with the borrow checker's "only one mutable reference" rule and requires "interior mutability", which is usually implemented with a RefCell or related standard library facilities.

Raw pointers don't have this limitation, so all I had to do was not expose the mutability difference to the outside via the methods of Arena and Cursor.

Of Concurrency

The Document is Send but not Sync, which means it can be sent across threads but can only be accessed inside one thread at a time. If there is a need to access it from multiple threads simultaneously, it is easy to put it into an Arc<Mutex<Document>>, but that does not extend to the cursors. It is generally necessary to remember the last position in scenarios where processing happens in multiple steps. An XMPP server thread might append the newly received elements into a document's last position for the currently handled connection, for example.

Another case is releasing a Cursor or a Document into Python. Lifetimes and marker traits for thread safety do not exist in Python. This wasn't a big issue in the C version because Python was relatively safer than C anyway, and the free-threaded interpreter did not exist.

Unfortunately, implementing a multi-threaded cursor requires access to the internals of the Document and unsafe use, so it has to be provided by the Iksemel crate to keep it safe and contained.

The solution is a separate SyncCursor which you can create by consuming a Document. This is necessary to guarantee that no references to the document are alive, and no further thread-unsafe references can be created afterward. It uses Arc and Mutex internally for thread-safe access. The overhead is negligible and well worth the safety in multi-threaded operations or for Python.

Of Strings

Rust strings are amazing! You could argue that they can be implemented in C, and in fact, Iksemel stored the length of every string to avoid recalculating it repeatedly. However, it also had to null-terminate them because the user would eventually need to pass a char* pointer to functions expecting a null terminator.

char *iks_stack_strcat (ikstack *s, char *old, size_t old_len, const char *src, size_t src_len);

This C API concatenates the two strings and inserts them into the arena (called a "stack" there); if the first string is already in the arena and there is space after it, it is simply extended. All good and fast, but there are so many ways to misuse this, such as passing the same arena string twice. These can be handled with some extra checks, but not at compile time.

pub fn concat_str<'a>(&'a self, old_s: &str, s: &str) -> Result<&'a str, NoMemory>

This Rust API, on the other hand, can work with any &str the caller provides since extending the string does not need to remove a null terminator, and all inserted strings remain immutable even while being extended.

I was a bit worried initially about the Display trait. If I used that to convert the document tree into XML text, would it be fast enough? My concern was about the formatter object being passed to the trait, and the extra function call indirections going through that instead of just using String::push. To my amazement, the generated assembly code in release build mode and speed tests matched the direct implementation exactly! The Rust compiler basically removed all the indirections.

One detail that required intervention was avoiding extra allocations for the final string object:

fn to_string(&self) -> String {
        let mut buf = String::with_capacity(self.str_size());

Walking over the tree once to calculate the final string size via the str_size method and reserving that capacity before serializing gives a measurable speed boost even for smallish documents, and a very large boost for big ones.

It looks like there is some internal mechanism in the formatting code to provide a size hint, but it is not exposed to implementors of the Display trait.

Of the Standard Library

Speaking of the standard library, let's talk about the elephant in the room. The Rust standard library is a bit too small, a sentiment shared by many people I've spoken with at community events.

As an example, both C and Python have a getpass function to read a password from the user, but in Rust, you need to depend on a third-party crate which isn't well-maintained. This was used in the included iksjab utility, which lets you send XMPP messages and perform other tasks from the command line.

The name resolver was a bit awkward too, requiring Iksemel to inject the default XMPP port into the address provided by the user before doing the lookup, instead of allowing it to be specified or omitted like literally every other resolver API. I was able to avoid another dependency with:

pub(super) fn need_port(host: &str) -> bool {
    let column_pos = host.rfind(':');
    let bracket_pos = host.find(']');
    match (column_pos, bracket_pos) {
        (None, None) | (None, Some(_)) => true,
        (Some(_), None) => false,
        (Some(column), Some(bracket)) => column < bracket,
    }
}

fn resolve_host_with_default_port(
    host: &str,
    default_port: u16,
) -> std::io::Result<std::vec::IntoIter<SocketAddr>> {
    if need_port(host) {
        (host, default_port).to_socket_addrs()
    } else {
        host.to_socket_addrs()
    }
}

But it was a lot of code just to call a simple lookup function, and I got it wrong on the first try (did you notice the rfind?).

Thankfully, Iksemel was mostly self-sufficient, so this is more of an issue when you are coming from a language like Python. When you need something, it is extremely easy to add it via Cargo, but while there are dozens of crates for every problem, their quality is all over the place, and evaluating them is difficult.

Of Python Bindings

This turned out to be the easiest part. Literally:

$ maturin new
$ maturin generate-ci github > .github/workflows/CI.yml

I then added #[pyclass]#[pymethods], and similar decorators around my functions, and boom! A working set of Python bindings.

Maturin (build system) and Pyo3 (bindings) have thought of everything, and they are well-documented.

Of Modules

The official documentation suggests using a modern file structure like this:

sax.rs
sax/parser.rs
sax/tests.rs
document.rs
document/synccursor.rs
document/tests.rs

This is probably the only decision I don't agree with Rust amongst thousands. Thankfully, the original structure is still supported:

sax/mod.rs
sax/parser.rs
sax/tests.rs
document/mod.rs
document/synccursor.rs
document/tests.rs

This mimics the __init__.py or index.ts usage from other modern languages and makes manual navigation much easier. Many crates also seem to follow this convention.

Of the Aftermath

* If Python is "batteries included", then Rust is "power tools included". The unit tests, documentation generator, documentation tests, formatting and linting, dependency management, build system, runtime validation, analyzers for IDEs, instrumentation, benchmarking, and seamless Python binding generation is either part of the language or just one command away in the ecosystem.

Can we count this a productivity win for Rust, if C programmers don't bother with any of those though?


* Rust's type system is not new if you have used OCaml, Haskell, or a similar functional language before. The difference is that you get it with zero overhead. Having type safety while beating C in speed is glorious!

* Writing Rust code is like having a constructive conversation with a highly knowledgeable but strict intellectual. The conversation is always focused on trade-offs or edge cases. You can choose the easy solution if it is good enough for your requirements, such as using more memory with a clone, or just panicking when an unexpected scenario happens, but it is always an informed decision, and never hidden behind opaque abstractions.

* Unsafe Rust is not well-introduced to newcomers in the main sources and is a little bit unergonomic. The good news is that it is improving fast, which could be the reason why documentation is lagging behind a bit. The need for it is also decreasing as the language and the ecosystem evolve, but C and C++ programmers will still want it. The other good news is that if you follow the guidelines, you can still achieve a level of safety not seen in C or C++.

* It is probably unfair to compare my original development experience of Iksemel with this Rust port on technical merits alone, as there is a two-decade gap in time and experience between them. I had a ton of fun back then. Rust made me feel the same joy of learning and experimenting today.

Comments

Popular posts from this blog

Solar Performance in Massachusetts

Boza