iksemel rusted
I had written some Rust code before, but it was for command-line tools, Advent of Code solutions, and similar small-scale programs. I was looking for a suitable project to delve deeper into Rust and decided to rewrite an old C project of mine in Rust!
Iksemel was a fast and easy-to-use XML/XMPP library which I developed over two decades ago. It was written in highly portable C and used in all sorts of places, including: an instant messenger on the Amiga home computer; an embedded sensor with 64 kilobytes of memory that posted measurements to users on Google Talk; the Scientific and Technological Research Council of Turkey's Linux distribution's Pisi Package Manager via its Python bindings; and a bunch of Symbian devices.
Did you know that Google Talk used to speak the open XMPP protocol before it was enshittified?
I haven't worked much on it recently since neither the XML format nor the XMPP instant messaging protocol is widely used anymore. But it was perfect for the task. The document tree was managed with self-referential and intrusive data structures with custom memory management, a challenge to the borrow checker! There were lots of low-level tricks for performance. It was also a complete package with tests, documentation, tools, and Python bindings, hence many areas to explore.
You can find the shiny new Iksemel Rust code here and the preliminary Python bindings here.
The rewrite took longer than I expected. The official Rust documentation is decent for someone coming from high-level languages, and perhaps even for those with no programming background, but it is not enough for C folks. The necessary information is spread between blog posts, issue comments, source code, and the collective wisdom of the community. This motivated me to write this post.
Despite the learning phase, the overall development process was much nicer and faster than the C version! I remember constant crashes and hard-to-debug bugs. The Rust version was immediately crash-free despite the use of unsafe, most of the code worked like a charm on the first try, and debugging in rare cases was a breeze. This was a result of the combination of the strong type system, memory safety features, and the comprehensive test tooling of Rust.
I haven't had a chance to profile and optimize the actual parsing code, or take advantage of SIMD and other modern hardware yet, but the performance looks excellent already. All benchmarks are lies, so take this with a grain of salt, but it is faster than libxml2 and even the read-only roxmltree crate on several giant Wikipedia dumps I have tested with.
xmlThrDefParserInputBufferCreateFilenameDefault is an actual function name of libxml2.
Of the Borrow Checker
clone(), to_string(), etc., is usually the compiler's suggestion and general beginner advice. If the performance loss caused by this were generally acceptable, programs wouldn't be written in C in the first place, so this is not a solution C programmers would settle for. However, even in Iksemel, there were a few places where it did the job perfectly.Rc or Arc. If shared/interior mutability is desired, RefCell or Mutex exists. Pin can help with self-references. There are other goodies in the standard library and crates. They are well-implemented and their performance costs are negligible everywhere except under the most demanding hot loops. The majority of programs have no reason to look further beyond this step.Of Callbacks and Events
typedef int (iksTagHook)(void *user_data, char *name, char **atts, int type); typedef int (iksCDataHook)(void *user_data, char *data, size_t len); iksparser *iks_sax_new (void *user_data, iksTagHook *tagHook, iksCDataHook *cdataHook); int iks_parse (iksparser *prs, const char *data, size_t len, int finish);
user_data to allow upper level parsers to maintain their state. Now, there are ways to implement these sorts of callbacks with Fn/FnMut/FnOnce and generics, but I realized how limiting this kind of API is going to be while playing with some examples, and implemented this instead:pub fn parse_bytes<'a>( &'a mut self, bytes: &'a [u8], ) -> Result<Option<(SaxElement<'a>, usize)>, ParseError>
None saying that all bytes are processed without producing a complete element yet.pub fn parse_bytes(&mut self, bytes: &[u8]) -> Result<(), ParseError> {} pub fn into_document(mut self) -> Result<Document, ParseError> {}
Rust's ability to efficiently return complex types makes it easy to implement this kind of APIs, which are highly suitable for Sans-IO.
Of Lending Iterators
Iterator trait for this so it could be used with a for loop; unfortunately, the trait does not support returning an item with a lifetime tied to the iterator itself.Iterator trait. Here is the implementation:impl<'a> SaxElements<'a> { pub fn new(parser: &'a mut SaxParser, bytes: &'a [u8]) -> Self { ... } pub fn next(&mut self) -> Option<Result<SaxElement<'_>, ParseError>> { ... } }
for loop, but it is still very easy to use with while let:let mut elements = parser.elements(b"<doc>example</doc>"); while let Some(result) = elements.next() { println!("Element parsed: {:?}", result?); }
Of Self-Referential Structs
Iksemel stores the XML data inside a custom memory arena containing two bump-allocator areas: one for the element structs and one for the character data. Container structs are allocated within the arena chunks, and the element structs have internal pointers forming the tree structure, which makes navigation fast and easy. Reducing the allocation count and packing similar objects together like this has a huge performance impact, but it cannot be done with Safe Rust at the moment.
There are some crates like Bumpalo which implement bump allocation for the general case, but this wasn't enough for Iksemel. There are also a couple of crates for implementing self-referential structs, but I found they obscured what was going on due to heavy macro use, and they weren't flexible enough either.
Of Unsafe
This is how Safe Rust is possible in the first place: by encapsulating the dangerous parts in an impossible-to-misuse abstraction. This API has two jobs: keeping the internal invariants of the unsafe code true, and not leaking anything to the outside which breaks the invariants of Safe Rust.
As an example, the XML document tree is represented as the struct Document, and any editing and navigation operations return a struct Cursor to move around the document and perform operations. The Document contains all the data, and the Cursor is essentially a reference to a certain node of the document tree.
let doc = Document::fromStr("<a><b>lala</b></a>"); let cursor: Cursor = doc.find_tag("b").first_child(); assert_eq!(cursor.cdata(), "lala");
cursor is pointing to it. Thankfully, Rust already solves this problem with lifetimes:pub struct Cursor<'a> { node: *mut Node, arena: &'a Arena, }
impl Document { pub fn find_tag<'a>(&'a self, name: &str) -> Cursor<'a> { // ... } }
PhantomData<Document> to link the lifetimes without an actual data member. Now it has a reference to the Arena member of the Document for reasons explained later in this post. Either way, this API ensures that the returned Cursor cannot stay alive longer than the Document. When the Document is dropped and the memory is freed, all the pointers into it are already gone.
iks *document, *cursor;
int e; document = iks_tree("<a><b>lala</b></a>", &e); if (IKS_OK == e) { cursor = iks_find(document, "b"); iks_delete(document); printf("crash here = %s", iks_name(cursor)); }
Another internal invariant is that no modifications via other cursors can invalidate an existing Cursor. This is ensured by the design which leaves the modified or deleted nodes intact in place and just hides them by changing internal links. These links can only be followed by one Cursor at a time, which is further ensured by not declaring cursors as Sync.
External invariants are a little bit more complicated. Thankfully, there is enough guidance in the Rust Reference and the Rustonomicon book. You might have noticed that Cursor::cdata returns a string reference to the internal buffer. It uses the &str type to avoid copying. Therefore, it follows &str invariant rules by: a) attaching its lifetime to the Cursor to prevent early free, b) ensuring that the returned pointer is never null, and c) ensuring all the strings inside the document character data memory are valid UTF-8 strings at all times.
Iksemel does use raw pointers, but they are all typed. When there is a need to calculate the size and alignment of an object to make space, the Layout module from the standard library is used instead of making assumptions. No conversion between a pointer and an integer is done. Pointer arithmetic to find space in the arena uses *const::add_byte and never jumps from one allocation area to another. std::mem::transmute is never used. The same memory area is never mapped to different structs.
The main theme between these avoided operations is that they challenge the compiler's model of data representation in memory. While there are rare cases in the standard library where they are needed, they are considerably riskier than regular unsafe use.
compile_fail flag like this:/// Cursor clone cannot outlive the Document: /// ```compile_fail /// let c2: Cursor; /// { /// let doc = Document::from_str("<a><b/></a>")?; /// let c1 = doc.root(); /// c2 = c1.clone(); /// } /// println!("{}", c2);
Unfortunately, you cannot specify the expected error, so occasionally you have to run cargo test -- --no-capture to check that they are not failing for other reasons. I haven't tried it yet, but this could be a good candidate to automate via snapshot testing.
This check would have been completely unnecessary in Safe Rust because the borrow checker wouldn't let you return anything from c1 which references its own data without proper lifetime association. But since the compiler cannot check the intention behind the raw pointers, a simple typo in the method signature in unsafe Rust can detach two lifetimes from each other with disastrous results.
It is quite slow; Iksemel's test suite takes a second for the CI run, but takes 18 minutes with Miri. Still, the release action of the Iksemel crate is gated on a successful Miri run, and I run Miri whenever I make changes to the unsafe code as well.
cargo fmt apply a highly opinionated style to everything. It helps when reading other people's code, using examples, and making unexpected things stand out amongst the surrounding lines. The last bit is especially important here since unsafe code depends on human review.Rust also has a ton of lints to check problematic patterns and non-idiomatic code. Some of these are not always applicable, but it is generally easier to enable all and give permission with the expect attribute when you have a good reason to ignore them.
Iksemel enables all of them except the pedantic and restriction categories, and then cherry-picks some of those as well:
#![deny(clippy::suspicious)] #![deny(clippy::complexity)] #![deny(clippy::perf)] #![deny(clippy::style)] #![deny(clippy::cargo)] #![deny(clippy::items_after_statements)] #![deny(clippy::missing_panics_doc)] #![deny(clippy::uninlined_format_args)] #![deny(clippy::unnecessary_semicolon)] #![deny(clippy::unreadable_literal)] #![deny(clippy::allow_attributes_without_reason)] #![deny(clippy::panic)] #![deny(clippy::partial_pub_fields)] #![deny(clippy::redundant_test_prefix)]
allow_attributes_without_reason lint is particularly useful. It forces you to explain the rationale whenever you have to ignore lints like this:#[expect( clippy::inherent_to_string_shadow_display, reason = "prereserving exact capacity makes this method significantly faster" )] fn to_string(&self) -> String { ... }
Of Mutation Testing
// in the attribute removal code if (*tag).last_attribute == attr { (*tag).last_attribute = (*attr).previous; } // in the attribute insertion code if !(*tag).last_attribute.is_null() { (*(*tag).last_attribute).next = attribute; (*attribute).previous = (*tag).last_attribute; } (*tag).last_attribute = attribute;
I was surprised when cargo-mutants showed a miss for the following mutation:
if (*tag).last_attribute != attr {
To get full coverage for the four branches in these two if statements, it is sufficient to test adding three attributes, removing the middle one, and then removing the last one. I had more tests than that, but the crucial scenario of removing the last element and then adding it back was forgotten!
Note that the bug exposed by this mutation depends on the order of operations, and mutant testing was able to find the testing gap quite easily.
Of Interior Mutability
Cursor API several times as I played with it while implementing higher-level parsers and command-line tools. The first iteration was like this:#[repr(transparent)] struct Cursor<'document> { node: *mut Node; _phantom: PhantomData<Document>; }
Arena structure, which was inside the Document. The C version didn't have the Cursor and Document separation; every node contained an arena pointer and the same iks* type referred to both a complete document and individual nodes, but that approach wasted a pointer per node. So, to avoid going back, I tried the following:let doc = Document::from_str("</test>"); doc.root().insert_cdata(&doc, "hello")?.append_cdata(&doc, " world!")?;
A worthy effort, but futile! The problem was the cursor lifetime we connected to the document. The borrow checker sees that and does not allow the document to be reborrowed by the same owner. Running away from the fight but keeping the lifetime safety, I changed it to:
pub struct Cursor<'a> { node: *mut Node, arena: &'a Arena, }
It felt like the structure size increased at first, but in practice, the compiler can easily analyze and optimize out the arena reference copying.
At this point, astute readers have already noticed that there is something missing with the arena reference, or with the Cursor itself, for that matter. Both are shared references! So how can they allocate memory or edit the document?
It is quite useful in practice to have two or more cursors pointing to different elements while traversing and editing XML documents. This doesn't play well with the borrow checker's "only one mutable reference" rule and requires "interior mutability", which is usually implemented with a RefCell or related standard library facilities.
Raw pointers don't have this limitation, so all I had to do was not expose the mutability difference to the outside via the methods of Arena and Cursor.
Of Concurrency
The Document is Send but not Sync, which means it can be sent across threads but can only be accessed inside one thread at a time. If there is a need to access it from multiple threads simultaneously, it is easy to put it into an Arc<Mutex<Document>>, but that does not extend to the cursors. It is generally necessary to remember the last position in scenarios where processing happens in multiple steps. An XMPP server thread might append the newly received elements into a document's last position for the currently handled connection, for example.
Another case is releasing a Cursor or a Document into Python. Lifetimes and marker traits for thread safety do not exist in Python. This wasn't a big issue in the C version because Python was relatively safer than C anyway, and the free-threaded interpreter did not exist.
Unfortunately, implementing a multi-threaded cursor requires access to the internals of the Document and unsafe use, so it has to be provided by the Iksemel crate to keep it safe and contained.
The solution is a separate SyncCursor which you can create by consuming a Document. This is necessary to guarantee that no references to the document are alive, and no further thread-unsafe references can be created afterward. It uses Arc and Mutex internally for thread-safe access. The overhead is negligible and well worth the safety in multi-threaded operations or for Python.
Of Strings
Rust strings are amazing! You could argue that they can be implemented in C, and in fact, Iksemel stored the length of every string to avoid recalculating it repeatedly. However, it also had to null-terminate them because the user would eventually need to pass a char* pointer to functions expecting a null terminator.
char *iks_stack_strcat (ikstack *s, char *old, size_t old_len, const char *src, size_t src_len);
pub fn concat_str<'a>(&'a self, old_s: &str, s: &str) -> Result<&'a str, NoMemory>
&str the caller provides since extending the string does not need to remove a null terminator, and all inserted strings remain immutable even while being extended.I was a bit worried initially about the Display trait. If I used that to convert the document tree into XML text, would it be fast enough? My concern was about the formatter object being passed to the trait, and the extra function call indirections going through that instead of just using String::push. To my amazement, the generated assembly code in release build mode and speed tests matched the direct implementation exactly! The Rust compiler basically removed all the indirections.
One detail that required intervention was avoiding extra allocations for the final string object:
fn to_string(&self) -> String { let mut buf = String::with_capacity(self.str_size());
str_size method and reserving that capacity before serializing gives a measurable speed boost even for smallish documents, and a very large boost for big ones.Display trait.Of the Standard Library
Speaking of the standard library, let's talk about the elephant in the room. The Rust standard library is a bit too small, a sentiment shared by many people I've spoken with at community events.
As an example, both C and Python have a getpass function to read a password from the user, but in Rust, you need to depend on a third-party crate which isn't well-maintained. This was used in the included iksjab utility, which lets you send XMPP messages and perform other tasks from the command line.
The name resolver was a bit awkward too, requiring Iksemel to inject the default XMPP port into the address provided by the user before doing the lookup, instead of allowing it to be specified or omitted like literally every other resolver API. I was able to avoid another dependency with:
pub(super) fn need_port(host: &str) -> bool { let column_pos = host.rfind(':'); let bracket_pos = host.find(']'); match (column_pos, bracket_pos) { (None, None) | (None, Some(_)) => true, (Some(_), None) => false, (Some(column), Some(bracket)) => column < bracket, } } fn resolve_host_with_default_port( host: &str, default_port: u16, ) -> std::io::Result<std::vec::IntoIter<SocketAddr>> { if need_port(host) { (host, default_port).to_socket_addrs() } else { host.to_socket_addrs() } }
rfind?).Thankfully, Iksemel was mostly self-sufficient, so this is more of an issue when you are coming from a language like Python. When you need something, it is extremely easy to add it via Cargo, but while there are dozens of crates for every problem, their quality is all over the place, and evaluating them is difficult.
Of Python Bindings
$ maturin new $ maturin generate-ci github > .github/workflows/CI.yml
#[pyclass], #[pymethods], and similar decorators around my functions, and boom! A working set of Python bindings.Of Modules
sax.rs sax/parser.rs sax/tests.rs document.rs document/synccursor.rs document/tests.rs
This is probably the only decision I don't agree with Rust amongst thousands. Thankfully, the original structure is still supported:
sax/mod.rs sax/parser.rs sax/tests.rs document/mod.rs document/synccursor.rs document/tests.rs
This mimics the __init__.py or index.ts usage from other modern languages and makes manual navigation much easier. Many crates also seem to follow this convention.
Of the Aftermath
Can we count this a productivity win for Rust, if C programmers don't bother with any of those though?
Comments
Post a Comment