Matthew Maurer | bec0e9a | 2023-06-14 16:28:50 +0000 | [diff] [blame] | 1 | xml-rs, an XML library for Rust |
| 2 | =============================== |
| 3 | |
| 4 | [](https://github.com/kornelski/xml-rs/actions/workflows/main.yml) |
| 5 | [![crates.io][crates-io-img]](https://lib.rs/crates/xml-rs) |
| 6 | [![docs][docs-img]](https://docs.rs/xml-rs/) |
| 7 | |
| 8 | [Documentation](https://docs.rs/xml-rs/) |
| 9 | |
| 10 | [crates-io-img]: https://img.shields.io/crates/v/xml-rs.svg |
| 11 | [docs-img]: https://img.shields.io/badge/docs-latest%20release-6495ed.svg |
| 12 | |
| 13 | xml-rs is an XML library for the [Rust](https://www.rust-lang.org/) programming language. |
| 14 | It supports reading and writing of XML documents in a streaming fashion (without DOM). |
| 15 | |
| 16 | ### Features |
| 17 | |
Matthew Maurer | bec0e9a | 2023-06-14 16:28:50 +0000 | [diff] [blame] | 18 | * XML spec conformance better than other pure-Rust libraries. |
| 19 | |
Jeff Vander Stoep | bf78730 | 2023-12-04 10:36:33 +0100 | [diff] [blame] | 20 | * Easy to use API based on `Iterator`s and regular `String`s without tricky lifetimes. |
| 21 | |
Matthew Maurer | bec0e9a | 2023-06-14 16:28:50 +0000 | [diff] [blame] | 22 | * Support for UTF-16, UTF-8, ISO-8859-1, and ASCII encodings. |
| 23 | |
Jeff Vander Stoep | bf78730 | 2023-12-04 10:36:33 +0100 | [diff] [blame] | 24 | * Written entirely in the safe Rust subset. Designed to safely handle untrusted input. |
Matthew Maurer | bec0e9a | 2023-06-14 16:28:50 +0000 | [diff] [blame] | 25 | |
| 26 | |
| 27 | The API is heavily inspired by Java Streaming API for XML ([StAX][stax]). It contains a pull parser much like StAX event reader. It provides an iterator API, so you can leverage Rust's existing iterators library features. |
| 28 | |
| 29 | [stax]: https://en.wikipedia.org/wiki/StAX |
| 30 | |
| 31 | It also provides a streaming document writer much like StAX event writer. |
| 32 | This writer consumes its own set of events, but reader events can be converted to |
| 33 | writer events easily, and so it is possible to write XML transformation chains in a pretty |
| 34 | clean manner. |
| 35 | |
| 36 | This parser is mostly full-featured, however, there are limitations: |
| 37 | * Legacy code pages and non-Unicode encodings are not supported; |
| 38 | * DTD validation is not supported (but entities defined in the internal subset are supported); |
| 39 | * attribute value normalization is not performed, and end-of-line characters are not normalized either. |
| 40 | |
| 41 | Other than that the parser tries to be mostly XML-1.1-compliant. |
| 42 | |
| 43 | Writer is also mostly full-featured with the following limitations: |
| 44 | * no support for encodings other than UTF-8, |
| 45 | * no support for emitting `<!DOCTYPE>` declarations; |
| 46 | * more validations of input are needed, for example, checking that namespace prefixes are bounded |
| 47 | or comments are well-formed. |
| 48 | |
| 49 | Building and using |
| 50 | ------------------ |
| 51 | |
| 52 | xml-rs uses [Cargo](https://crates.io), so add it with `cargo add xml` or modify `Cargo.toml`: |
| 53 | |
| 54 | ```toml |
| 55 | [dependencies] |
Jeff Vander Stoep | bf78730 | 2023-12-04 10:36:33 +0100 | [diff] [blame] | 56 | xml = "0.8.16" |
Matthew Maurer | bec0e9a | 2023-06-14 16:28:50 +0000 | [diff] [blame] | 57 | ``` |
| 58 | |
| 59 | The package exposes a single crate called `xml`. |
| 60 | |
| 61 | Reading XML documents |
| 62 | --------------------- |
| 63 | |
| 64 | [`xml::reader::EventReader`](EventReader) requires a [`Read`](stdread) instance to read from. It can be a `File` wrapped in `BufReader`, or a `Vec<u8>`, or a `&[u8]` slice. |
| 65 | |
| 66 | [EventReader]: https://docs.rs/xml-rs/latest/xml/reader/struct.EventReader.html |
| 67 | [stdread]: https://doc.rust-lang.org/stable/std/io/trait.Read.html |
| 68 | |
| 69 | `EventReader` implements `IntoIterator` trait, so you can use it in a `for` loop directly: |
| 70 | |
| 71 | ```rust,no_run |
| 72 | use std::fs::File; |
| 73 | use std::io::BufReader; |
| 74 | |
| 75 | use xml::reader::{EventReader, XmlEvent}; |
| 76 | |
| 77 | fn main() -> std::io::Result<()> { |
| 78 | let file = File::open("file.xml")?; |
| 79 | let file = BufReader::new(file); // Buffering is important for performance |
| 80 | |
| 81 | let parser = EventReader::new(file); |
| 82 | let mut depth = 0; |
| 83 | for e in parser { |
| 84 | match e { |
| 85 | Ok(XmlEvent::StartElement { name, .. }) => { |
| 86 | println!("{:spaces$}+{name}", "", spaces = depth * 2); |
| 87 | depth += 1; |
| 88 | } |
| 89 | Ok(XmlEvent::EndElement { name }) => { |
| 90 | depth -= 1; |
| 91 | println!("{:spaces$}-{name}", "", spaces = depth * 2); |
| 92 | } |
| 93 | Err(e) => { |
| 94 | eprintln!("Error: {e}"); |
| 95 | break; |
| 96 | } |
| 97 | // There's more: https://docs.rs/xml-rs/latest/xml/reader/enum.XmlEvent.html |
| 98 | _ => {} |
| 99 | } |
| 100 | } |
| 101 | |
| 102 | Ok(()) |
| 103 | } |
| 104 | ``` |
| 105 | |
| 106 | Document parsing can end normally or with an error. Regardless of exact cause, the parsing |
| 107 | process will be stopped, and the iterator will terminate normally. |
| 108 | |
| 109 | You can also have finer control over when to pull the next event from the parser using its own |
| 110 | `next()` method: |
| 111 | |
| 112 | ```rust,ignore |
| 113 | match parser.next() { |
| 114 | ... |
| 115 | } |
| 116 | ``` |
| 117 | |
| 118 | Upon the end of the document or an error, the parser will remember the last event and will always |
| 119 | return it in the result of `next()` call afterwards. If iterator is used, then it will yield |
| 120 | error or end-of-document event once and will produce `None` afterwards. |
| 121 | |
| 122 | It is also possible to tweak parsing process a little using [`xml::reader::ParserConfig`][ParserConfig] structure. |
| 123 | See its documentation for more information and examples. |
| 124 | |
| 125 | [ParserConfig]: https://docs.rs/xml-rs/latest/xml/reader/struct.ParserConfig.html |
| 126 | |
| 127 | You can find a more extensive example of using `EventReader` in `src/analyze.rs`, which is a |
| 128 | small program (BTW, it is built with `cargo build` and can be run after that) which shows various |
| 129 | statistics about specified XML document. It can also be used to check for well-formedness of |
| 130 | XML documents - if a document is not well-formed, this program will exit with an error. |
| 131 | |
Jeff Vander Stoep | bf78730 | 2023-12-04 10:36:33 +0100 | [diff] [blame] | 132 | |
| 133 | ## Parsing untrusted inputs |
| 134 | |
| 135 | The parser is written in safe Rust subset, so by Rust's guarantees the worst that it can do is to cause a panic. |
| 136 | You can use `ParserConfig` to set limits on maximum lenghts of names, attributes, text, entities, etc. |
| 137 | You should also set a maximum document size via `io::Read`'s [`take(max)`](https://doc.rust-lang.org/stable/std/io/trait.Read.html#method.take) method. |
| 138 | |
Matthew Maurer | bec0e9a | 2023-06-14 16:28:50 +0000 | [diff] [blame] | 139 | Writing XML documents |
| 140 | --------------------- |
| 141 | |
| 142 | xml-rs also provides a streaming writer much like StAX event writer. With it you can write an |
| 143 | XML document to any `Write` implementor. |
| 144 | |
| 145 | ```rust,no_run |
| 146 | use std::io; |
| 147 | use xml::writer::{EmitterConfig, XmlEvent}; |
| 148 | |
| 149 | /// A simple demo syntax where "+foo" makes `<foo>`, "-foo" makes `</foo>` |
| 150 | fn make_event_from_line(line: &str) -> XmlEvent { |
| 151 | let line = line.trim(); |
| 152 | if let Some(name) = line.strip_prefix("+") { |
| 153 | XmlEvent::start_element(name).into() |
| 154 | } else if line.starts_with("-") { |
| 155 | XmlEvent::end_element().into() |
| 156 | } else { |
| 157 | XmlEvent::characters(line).into() |
| 158 | } |
| 159 | } |
| 160 | |
| 161 | fn main() -> io::Result<()> { |
| 162 | let input = io::stdin(); |
| 163 | let output = io::stdout(); |
| 164 | let mut writer = EmitterConfig::new() |
| 165 | .perform_indent(true) |
| 166 | .create_writer(output); |
| 167 | |
| 168 | let mut line = String::new(); |
| 169 | loop { |
| 170 | line.clear(); |
| 171 | let bytes_read = input.read_line(&mut line)?; |
| 172 | if bytes_read == 0 { |
| 173 | break; // EOF |
| 174 | } |
| 175 | |
| 176 | let event = make_event_from_line(&line); |
| 177 | if let Err(e) = writer.write(event) { |
| 178 | panic!("Write error: {e}") |
| 179 | } |
| 180 | } |
| 181 | Ok(()) |
| 182 | } |
| 183 | ``` |
| 184 | |
| 185 | The code example above also demonstrates how to create a writer out of its configuration. |
| 186 | Similar thing also works with `EventReader`. |
| 187 | |
| 188 | The library provides an XML event building DSL which helps to construct complex events, |
| 189 | e.g. ones having namespace definitions. Some examples: |
| 190 | |
| 191 | ```rust,ignore |
| 192 | // <a:hello a:param="value" xmlns:a="urn:some:document"> |
| 193 | XmlEvent::start_element("a:hello").attr("a:param", "value").ns("a", "urn:some:document") |
| 194 | |
| 195 | // <hello b:config="name" xmlns="urn:default:uri"> |
| 196 | XmlEvent::start_element("hello").attr("b:config", "value").default_ns("urn:defaul:uri") |
| 197 | |
| 198 | // <![CDATA[some unescaped text]]> |
| 199 | XmlEvent::cdata("some unescaped text") |
| 200 | ``` |
| 201 | |
| 202 | Of course, one can create `XmlEvent` enum variants directly instead of using the builder DSL. |
| 203 | There are more examples in [`xml::writer::XmlEvent`][XmlEvent] documentation. |
| 204 | |
| 205 | [XmlEvent]: https://docs.rs/xml-rs/latest/xml/reader/enum.XmlEvent.html |
| 206 | |
| 207 | The writer has multiple configuration options; see `EmitterConfig` documentation for more |
| 208 | information. |
| 209 | |
| 210 | [EmitterConfig]: https://docs.rs/xml-rs/latest/xml/writer/struct.EmitterConfig.html |
| 211 | |
| 212 | Bug reports |
| 213 | ------------ |
| 214 | |
| 215 | Please report issues at: <https://github.com/kornelski/xml-rs/issues>. |
| 216 | |