| /*! |
| A tutorial for handling CSV data in Rust. |
| |
| This tutorial will cover basic CSV reading and writing, automatic |
| (de)serialization with Serde, CSV transformations and performance. |
| |
| This tutorial is targeted at beginner Rust programmers. Experienced Rust |
| programmers may find this tutorial to be too verbose, but skimming may be |
| useful. There is also a |
| [cookbook](../cookbook/index.html) |
| of examples for those that prefer more information density. |
| |
| For an introduction to Rust, please see the |
| [official book](https://doc.rust-lang.org/book/second-edition/). |
| If you haven't written any Rust code yet but have written code in another |
| language, then this tutorial might be accessible to you without needing to read |
| the book first. |
| |
| # Table of contents |
| |
| 1. [Setup](#setup) |
| 1. [Basic error handling](#basic-error-handling) |
| * [Switch to recoverable errors](#switch-to-recoverable-errors) |
| 1. [Reading CSV](#reading-csv) |
| * [Reading headers](#reading-headers) |
| * [Delimiters, quotes and variable length records](#delimiters-quotes-and-variable-length-records) |
| * [Reading with Serde](#reading-with-serde) |
| * [Handling invalid data with Serde](#handling-invalid-data-with-serde) |
| 1. [Writing CSV](#writing-csv) |
| * [Writing tab separated values](#writing-tab-separated-values) |
| * [Writing with Serde](#writing-with-serde) |
| 1. [Pipelining](#pipelining) |
| * [Filter by search](#filter-by-search) |
| * [Filter by population count](#filter-by-population-count) |
| 1. [Performance](#performance) |
| * [Amortizing allocations](#amortizing-allocations) |
| * [Serde and zero allocation](#serde-and-zero-allocation) |
| * [CSV parsing without the standard library](#csv-parsing-without-the-standard-library) |
| 1. [Closing thoughts](#closing-thoughts) |
| |
| # Setup |
| |
| In this section, we'll get you setup with a simple program that reads CSV data |
| and prints a "debug" version of each record. This assumes that you have the |
| [Rust toolchain installed](https://www.rust-lang.org/install.html), |
| which includes both Rust and Cargo. |
| |
| We'll start by creating a new Cargo project: |
| |
| ```text |
| $ cargo new --bin csvtutor |
| $ cd csvtutor |
| ``` |
| |
| Once inside `csvtutor`, open `Cargo.toml` in your favorite text editor and add |
| `csv = "1.1"` to your `[dependencies]` section. At this point, your |
| `Cargo.toml` should look something like this: |
| |
| ```text |
| [package] |
| name = "csvtutor" |
| version = "0.1.0" |
| authors = ["Your Name"] |
| |
| [dependencies] |
| csv = "1.1" |
| ``` |
| |
| Next, let's build your project. Since you added the `csv` crate as a |
| dependency, Cargo will automatically download it and compile it for you. To |
| build your project, use Cargo: |
| |
| ```text |
| $ cargo build |
| ``` |
| |
| This will produce a new binary, `csvtutor`, in your `target/debug` directory. |
| It won't do much at this point, but you can run it: |
| |
| ```text |
| $ ./target/debug/csvtutor |
| Hello, world! |
| ``` |
| |
| Let's make our program do something useful. Our program will read CSV data on |
| stdin and print debug output for each record on stdout. To write this program, |
| open `src/main.rs` in your favorite text editor and replace its contents with |
| this: |
| |
| ```no_run |
| //tutorial-setup-01.rs |
| // Import the standard library's I/O module so we can read from stdin. |
| use std::io; |
| |
| // The `main` function is where your program starts executing. |
| fn main() { |
| // Create a CSV parser that reads data from stdin. |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| // Loop over each record. |
| for result in rdr.records() { |
| // An error may occur, so abort the program in an unfriendly way. |
| // We will make this more friendly later! |
| let record = result.expect("a CSV record"); |
| // Print a debug version of the record. |
| println!("{:?}", record); |
| } |
| } |
| ``` |
| |
| Don't worry too much about what this code means; we'll dissect it in the next |
| section. For now, try rebuilding your project: |
| |
| ```text |
| $ cargo build |
| ``` |
| |
| Assuming that succeeds, let's try running our program. But first, we will need |
| some CSV data to play with! For that, we will use a random selection of 100 |
| US cities, along with their population size and geographical coordinates. (We |
| will use this same CSV data throughout the entire tutorial.) To get the data, |
| download it from github: |
| |
| ```text |
| $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop.csv' |
| ``` |
| |
| And now finally, run your program on `uspop.csv`: |
| |
| ```text |
| $ ./target/debug/csvtutor < uspop.csv |
| StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"]) |
| StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"]) |
| StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"]) |
| # ... and much more |
| ``` |
| |
| # Basic error handling |
| |
| Since reading CSV data can result in errors, error handling is pervasive |
| throughout the examples in this tutorial. Therefore, we're going to spend a |
| little bit of time going over basic error handling, and in particular, fix |
| our previous example to show errors in a more friendly way. **If you're already |
| comfortable with things like `Result` and `try!`/`?` in Rust, then you can |
| safely skip this section.** |
| |
| Note that |
| [The Rust Programming Language Book](https://doc.rust-lang.org/book/second-edition/) |
| contains an |
| [introduction to general error handling](https://doc.rust-lang.org/book/second-edition/ch09-00-error-handling.html). |
| For a deeper dive, see |
| [my blog post on error handling in Rust](http://blog.burntsushi.net/rust-error-handling/). |
| The blog post is especially important if you plan on building Rust libraries. |
| |
| With that out of the way, error handling in Rust comes in two different forms: |
| unrecoverable errors and recoverable errors. |
| |
| Unrecoverable errors generally correspond to things like bugs in your program, |
| which might occur when an invariant or contract is broken. At that point, the |
| state of your program is unpredictable, and there's typically little recourse |
| other than *panicking*. In Rust, a panic is similar to simply aborting your |
| program, but it will unwind the stack and clean up resources before your |
| program exits. |
| |
| On the other hand, recoverable errors generally correspond to predictable |
| errors. A non-existent file or invalid CSV data are examples of recoverable |
| errors. In Rust, recoverable errors are handled via `Result`. A `Result` |
| represents the state of a computation that has either succeeded or failed. |
| It is defined like so: |
| |
| ``` |
| enum Result<T, E> { |
| Ok(T), |
| Err(E), |
| } |
| ``` |
| |
| That is, a `Result` either contains a value of type `T` when the computation |
| succeeds, or it contains a value of type `E` when the computation fails. |
| |
| The relationship between unrecoverable errors and recoverable errors is |
| important. In particular, it is **strongly discouraged** to treat recoverable |
| errors as if they were unrecoverable. For example, panicking when a file could |
| not be found, or if some CSV data is invalid, is considered bad practice. |
| Instead, predictable errors should be handled using Rust's `Result` type. |
| |
| With our new found knowledge, let's re-examine our previous example and dissect |
| its error handling. |
| |
| ```no_run |
| //tutorial-error-01.rs |
| use std::io; |
| |
| fn main() { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| for result in rdr.records() { |
| let record = result.expect("a CSV record"); |
| println!("{:?}", record); |
| } |
| } |
| ``` |
| |
| There are two places where an error can occur in this program. The first is |
| if there was a problem reading a record from stdin. The second is if there is |
| a problem writing to stdout. In general, we will ignore the latter problem in |
| this tutorial, although robust command line applications should probably try |
| to handle it (e.g., when a broken pipe occurs). The former however is worth |
| looking into in more detail. For example, if a user of this program provides |
| invalid CSV data, then the program will panic: |
| |
| ```text |
| $ cat invalid |
| header1,header2 |
| foo,bar |
| quux,baz,foobar |
| $ ./target/debug/csvtutor < invalid |
| StringRecord(["foo", "bar"]) |
| thread 'main' panicked at 'a CSV record: Error(UnequalLengths { pos: Some(Position { byte: 24, line: 3, record: 2 }), expected_len: 2, len: 3 })', src/main.rs:13:29 |
| note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace |
| ``` |
| |
| What happened here? First and foremost, we should talk about why the CSV data |
| is invalid. The CSV data consists of three records: a header and two data |
| records. The header and first data record have two fields, but the second |
| data record has three fields. By default, the csv crate will treat inconsistent |
| record lengths as an error. |
| (This behavior can be toggled using the |
| [`ReaderBuilder::flexible`](../struct.ReaderBuilder.html#method.flexible) |
| config knob.) This explains why the first data record is printed in this |
| example, since it has the same number of fields as the header record. That is, |
| we don't actually hit an error until we parse the second data record. |
| |
| (Note that the CSV reader automatically interprets the first record as a |
| header. This can be toggled with the |
| [`ReaderBuilder::has_headers`](../struct.ReaderBuilder.html#method.has_headers) |
| config knob.) |
| |
| So what actually causes the panic to happen in our program? That would be the |
| first line in our loop: |
| |
| ```ignore |
| for result in rdr.records() { |
| let record = result.expect("a CSV record"); // this panics |
| println!("{:?}", record); |
| } |
| ``` |
| |
| The key thing to understand here is that `rdr.records()` returns an iterator |
| that yields `Result` values. That is, instead of yielding records, it yields |
| a `Result` that contains either a record or an error. The `expect` method, |
| which is defined on `Result`, *unwraps* the success value inside the `Result`. |
| Since the `Result` might contain an error instead, `expect` will *panic* when |
| it does contain an error. |
| |
| It might help to look at the implementation of `expect`: |
| |
| ```ignore |
| use std::fmt; |
| |
| // This says, "for all types T and E, where E can be turned into a human |
| // readable debug message, define the `expect` method." |
| impl<T, E: fmt::Debug> Result<T, E> { |
| fn expect(self, msg: &str) -> T { |
| match self { |
| Ok(t) => t, |
| Err(e) => panic!("{}: {:?}", msg, e), |
| } |
| } |
| } |
| ``` |
| |
| Since this causes a panic if the CSV data is invalid, and invalid CSV data is |
| a perfectly predictable error, we've turned what should be a *recoverable* |
| error into an *unrecoverable* error. We did this because it is expedient to |
| use unrecoverable errors. Since this is bad practice, we will endeavor to avoid |
| unrecoverable errors throughout the rest of the tutorial. |
| |
| ## Switch to recoverable errors |
| |
| We'll convert our unrecoverable error to a recoverable error in 3 steps. First, |
| let's get rid of the panic and print an error message manually: |
| |
| ```no_run |
| //tutorial-error-02.rs |
| use std::{io, process}; |
| |
| fn main() { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| for result in rdr.records() { |
| // Examine our Result. |
| // If there was no problem, print the record. |
| // Otherwise, print the error message and quit the program. |
| match result { |
| Ok(record) => println!("{:?}", record), |
| Err(err) => { |
| println!("error reading CSV from <stdin>: {}", err); |
| process::exit(1); |
| } |
| } |
| } |
| } |
| ``` |
| |
| If we run our program again, we'll still see an error message, but it is no |
| longer a panic message: |
| |
| ```text |
| $ cat invalid |
| header1,header2 |
| foo,bar |
| quux,baz,foobar |
| $ ./target/debug/csvtutor < invalid |
| StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] } |
| error reading CSV from <stdin>: CSV error: record 2 (line: 3, byte: 24): found record with 3 fields, but the previous record has 2 fields |
| ``` |
| |
| The second step for moving to recoverable errors is to put our CSV record loop |
| into a separate function. This function then has the option of *returning* an |
| error, which our `main` function can then inspect and decide what to do with. |
| |
| ```no_run |
| //tutorial-error-03.rs |
| use std::{error::Error, io, process}; |
| |
| fn main() { |
| if let Err(err) = run() { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| for result in rdr.records() { |
| // Examine our Result. |
| // If there was no problem, print the record. |
| // Otherwise, convert our error to a Box<dyn Error> and return it. |
| match result { |
| Err(err) => return Err(From::from(err)), |
| Ok(record) => { |
| println!("{:?}", record); |
| } |
| } |
| } |
| Ok(()) |
| } |
| ``` |
| |
| Our new function, `run`, has a return type of `Result<(), Box<dyn Error>>`. In |
| simple terms, this says that `run` either returns nothing when successful, or |
| if an error occurred, it returns a `Box<dyn Error>`, which stands for "any kind of |
| error." A `Box<dyn Error>` is hard to inspect if we cared about the specific error |
| that occurred. But for our purposes, all we need to do is gracefully print an |
| error message and exit the program. |
| |
| The third and final step is to replace our explicit `match` expression with a |
| special Rust language feature: the question mark. |
| |
| ```no_run |
| //tutorial-error-04.rs |
| use std::{error::Error, io, process}; |
| |
| fn main() { |
| if let Err(err) = run() { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| for result in rdr.records() { |
| // This is effectively the same code as our `match` in the |
| // previous example. In other words, `?` is syntactic sugar. |
| let record = result?; |
| println!("{:?}", record); |
| } |
| Ok(()) |
| } |
| ``` |
| |
| This last step shows how we can use the `?` to automatically forward errors |
| to our caller without having to do explicit case analysis with `match` |
| ourselves. We will use the `?` heavily throughout this tutorial, and it's |
| important to note that it can **only be used in functions that return |
| `Result`.** |
| |
| We'll end this section with a word of caution: using `Box<dyn Error>` as our error |
| type is the minimally acceptable thing we can do here. Namely, while it allows |
| our program to gracefully handle errors, it makes it hard for callers to |
| inspect the specific error condition that occurred. However, since this is a |
| tutorial on writing command line programs that do CSV parsing, we will consider |
| ourselves satisfied. If you'd like to know more, or are interested in writing |
| a library that handles CSV data, then you should check out my |
| [blog post on error handling](http://blog.burntsushi.net/rust-error-handling/). |
| |
| With all that said, if all you're doing is writing a one-off program to do |
| CSV transformations, then using methods like `expect` and panicking when an |
| error occurs is a perfectly reasonable thing to do. Nevertheless, this tutorial |
| will endeavor to show idiomatic code. |
| |
| # Reading CSV |
| |
| Now that we've got you setup and covered basic error handling, it's time to do |
| what we came here to do: handle CSV data. We've already seen how to read |
| CSV data from `stdin`, but this section will cover how to read CSV data from |
| files and how to configure our CSV reader to data formatted with different |
| delimiters and quoting strategies. |
| |
| First up, let's adapt the example we've been working with to accept a file |
| path argument instead of stdin. |
| |
| ```no_run |
| //tutorial-read-01.rs |
| use std::{ |
| env, |
| error::Error, |
| ffi::OsString, |
| fs::File, |
| process, |
| }; |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let file_path = get_first_arg()?; |
| let file = File::open(file_path)?; |
| let mut rdr = csv::Reader::from_reader(file); |
| for result in rdr.records() { |
| let record = result?; |
| println!("{:?}", record); |
| } |
| Ok(()) |
| } |
| |
| /// Returns the first positional argument sent to this process. If there are no |
| /// positional arguments, then this returns an error. |
| fn get_first_arg() -> Result<OsString, Box<dyn Error>> { |
| match env::args_os().nth(1) { |
| None => Err(From::from("expected 1 argument, but got none")), |
| Some(file_path) => Ok(file_path), |
| } |
| } |
| |
| fn main() { |
| if let Err(err) = run() { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| ``` |
| |
| If you replace the contents of your `src/main.rs` file with the above code, |
| then you should be able to rebuild your project and try it out: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor uspop.csv |
| StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"]) |
| StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"]) |
| StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"]) |
| # ... and much more |
| ``` |
| |
| This example contains two new pieces of code: |
| |
| 1. Code for querying the positional arguments of your program. We put this code |
| into its own function called `get_first_arg`. Our program expects a file |
| path in the first position (which is indexed at `1`; the argument at index |
| `0` is the executable name), so if one doesn't exist, then `get_first_arg` |
| returns an error. |
| 2. Code for opening a file. In `run`, we open a file using `File::open`. If |
| there was a problem opening the file, we forward the error to the caller of |
| `run` (which is `main` in this program). Note that we do *not* wrap the |
| `File` in a buffer. The CSV reader does buffering internally, so there's |
| no need for the caller to do it. |
| |
| Now is a good time to introduce an alternate CSV reader constructor, which |
| makes it slightly more convenient to open CSV data from a file. That is, |
| instead of: |
| |
| ```ignore |
| let file_path = get_first_arg()?; |
| let file = File::open(file_path)?; |
| let mut rdr = csv::Reader::from_reader(file); |
| ``` |
| |
| you can use: |
| |
| ```ignore |
| let file_path = get_first_arg()?; |
| let mut rdr = csv::Reader::from_path(file_path)?; |
| ``` |
| |
| `csv::Reader::from_path` will open the file for you and return an error if |
| the file could not be opened. |
| |
| ## Reading headers |
| |
| If you had a chance to look at the data inside `uspop.csv`, you would notice |
| that there is a header record that looks like this: |
| |
| ```text |
| City,State,Population,Latitude,Longitude |
| ``` |
| |
| Now, if you look back at the output of the commands you've run so far, you'll |
| notice that the header record is never printed. Why is that? By default, the |
| CSV reader will interpret the first record in CSV data as a header, which |
| is typically distinct from the actual data in the records that follow. |
| Therefore, the header record is always skipped whenever you try to read or |
| iterate over the records in CSV data. |
| |
| The CSV reader does not try to be smart about the header record and does |
| **not** employ any heuristics for automatically detecting whether the first |
| record is a header or not. Instead, if you don't want to treat the first record |
| as a header, you'll need to tell the CSV reader that there are no headers. |
| |
| To configure a CSV reader to do this, we'll need to use a |
| [`ReaderBuilder`](../struct.ReaderBuilder.html) |
| to build a CSV reader with our desired configuration. Here's an example that |
| does just that. (Note that we've moved back to reading from `stdin`, since it |
| produces terser examples.) |
| |
| ```no_run |
| //tutorial-read-headers-01.rs |
| # use std::{error::Error, io, process}; |
| # |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::ReaderBuilder::new() |
| .has_headers(false) |
| .from_reader(io::stdin()); |
| for result in rdr.records() { |
| let record = result?; |
| println!("{:?}", record); |
| } |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| If you compile and run this program with our `uspop.csv` data, then you'll see |
| that the header record is now printed: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor < uspop.csv |
| StringRecord(["City", "State", "Population", "Latitude", "Longitude"]) |
| StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"]) |
| StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"]) |
| StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"]) |
| ``` |
| |
| If you ever need to access the header record directly, then you can use the |
| [`Reader::header`](../struct.Reader.html#method.headers) |
| method like so: |
| |
| ```no_run |
| //tutorial-read-headers-02.rs |
| # use std::{error::Error, io, process}; |
| # |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| { |
| // We nest this call in its own scope because of lifetimes. |
| let headers = rdr.headers()?; |
| println!("{:?}", headers); |
| } |
| for result in rdr.records() { |
| let record = result?; |
| println!("{:?}", record); |
| } |
| // We can ask for the headers at any time. There's no need to nest this |
| // call in its own scope because we never try to borrow the reader again. |
| let headers = rdr.headers()?; |
| println!("{:?}", headers); |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| One interesting thing to note in this example is that we put the call to |
| `rdr.headers()` in its own scope. We do this because `rdr.headers()` returns |
| a *borrow* of the reader's internal header state. The nested scope in this |
| code allows the borrow to end before we try to iterate over the records. If |
| we didn't nest the call to `rdr.headers()` in its own scope, then the code |
| wouldn't compile because we cannot borrow the reader's headers at the same time |
| that we try to borrow the reader to iterate over its records. |
| |
| Another way of solving this problem is to *clone* the header record: |
| |
| ```ignore |
| let headers = rdr.headers()?.clone(); |
| ``` |
| |
| This converts it from a borrow of the CSV reader to a new owned value. This |
| makes the code a bit easier to read, but at the cost of copying the header |
| record into a new allocation. |
| |
| ## Delimiters, quotes and variable length records |
| |
| In this section we'll temporarily depart from our `uspop.csv` data set and |
| show how to read some CSV data that is a little less clean. This CSV data |
| uses `;` as a delimiter, escapes quotes with `\"` (instead of `""`) and has |
| records of varying length. Here's the data, which contains a list of WWE |
| wrestlers and the year they started, if it's known: |
| |
| ```text |
| $ cat strange.csv |
| "\"Hacksaw\" Jim Duggan";1987 |
| "Bret \"Hit Man\" Hart";1984 |
| # We're not sure when Rafael started, so omit the year. |
| Rafael Halperin |
| "\"Big Cat\" Ernie Ladd";1964 |
| "\"Macho Man\" Randy Savage";1985 |
| "Jake \"The Snake\" Roberts";1986 |
| ``` |
| |
| To read this CSV data, we'll want to do the following: |
| |
| 1. Disable headers, since this data has none. |
| 2. Change the delimiter from `,` to `;`. |
| 3. Change the quote strategy from doubled (e.g., `""`) to escaped (e.g., `\"`). |
| 4. Permit flexible length records, since some omit the year. |
| 5. Ignore lines beginning with a `#`. |
| |
| All of this (and more!) can be configured with a |
| [`ReaderBuilder`](../struct.ReaderBuilder.html), |
| as seen in the following example: |
| |
| ```no_run |
| //tutorial-read-delimiter-01.rs |
| # use std::{error::Error, io, process}; |
| # |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::ReaderBuilder::new() |
| .has_headers(false) |
| .delimiter(b';') |
| .double_quote(false) |
| .escape(Some(b'\\')) |
| .flexible(true) |
| .comment(Some(b'#')) |
| .from_reader(io::stdin()); |
| for result in rdr.records() { |
| let record = result?; |
| println!("{:?}", record); |
| } |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| Now re-compile your project and try running the program on `strange.csv`: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor < strange.csv |
| StringRecord(["\"Hacksaw\" Jim Duggan", "1987"]) |
| StringRecord(["Bret \"Hit Man\" Hart", "1984"]) |
| StringRecord(["Rafael Halperin"]) |
| StringRecord(["\"Big Cat\" Ernie Ladd", "1964"]) |
| StringRecord(["\"Macho Man\" Randy Savage", "1985"]) |
| StringRecord(["Jake \"The Snake\" Roberts", "1986"]) |
| ``` |
| |
| You should feel encouraged to play around with the settings. Some interesting |
| things you might try: |
| |
| 1. If you remove the `escape` setting, notice that no CSV errors are reported. |
| Instead, records are still parsed. This is a feature of the CSV parser. Even |
| though it gets the data slightly wrong, it still provides a parse that you |
| might be able to work with. This is a useful property given the messiness |
| of real world CSV data. |
| 2. If you remove the `delimiter` setting, parsing still succeeds, although |
| every record has exactly one field. |
| 3. If you remove the `flexible` setting, the reader will print the first two |
| records (since they both have the same number of fields), but will return a |
| parse error on the third record, since it has only one field. |
| |
| This covers most of the things you might want to configure on your CSV reader, |
| although there are a few other knobs. For example, you can change the record |
| terminator from a new line to any other character. (By default, the terminator |
| is `CRLF`, which treats each of `\r\n`, `\r` and `\n` as single record |
| terminators.) For more details, see the documentation and examples for each of |
| the methods on |
| [`ReaderBuilder`](../struct.ReaderBuilder.html). |
| |
| ## Reading with Serde |
| |
| One of the most convenient features of this crate is its support for |
| [Serde](https://serde.rs/). |
| Serde is a framework for automatically serializing and deserializing data into |
| Rust types. In simpler terms, that means instead of iterating over records |
| as an array of string fields, we can iterate over records of a specific type |
| of our choosing. |
| |
| For example, let's take a look at some data from our `uspop.csv` file: |
| |
| ```text |
| City,State,Population,Latitude,Longitude |
| Davidsons Landing,AK,,65.2419444,-165.2716667 |
| Kenai,AK,7610,60.5544444,-151.2583333 |
| ``` |
| |
| While some of these fields make sense as strings (`City`, `State`), other |
| fields look more like numbers. For example, `Population` looks like it contains |
| integers while `Latitude` and `Longitude` appear to contain decimals. If we |
| wanted to convert these fields to their "proper" types, then we need to do |
| a lot of manual work. This next example shows how. |
| |
| ```no_run |
| //tutorial-read-serde-01.rs |
| # use std::{error::Error, io, process}; |
| # |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| for result in rdr.records() { |
| let record = result?; |
| |
| let city = &record[0]; |
| let state = &record[1]; |
| // Some records are missing population counts, so if we can't |
| // parse a number, treat the population count as missing instead |
| // of returning an error. |
| let pop: Option<u64> = record[2].parse().ok(); |
| // Lucky us! Latitudes and longitudes are available for every record. |
| // Therefore, if one couldn't be parsed, return an error. |
| let latitude: f64 = record[3].parse()?; |
| let longitude: f64 = record[4].parse()?; |
| |
| println!( |
| "city: {:?}, state: {:?}, \ |
| pop: {:?}, latitude: {:?}, longitude: {:?}", |
| city, state, pop, latitude, longitude); |
| } |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| The problem here is that we need to parse each individual field manually, which |
| can be labor intensive and repetitive. Serde, however, makes this process |
| automatic. For example, we can ask to deserialize every record into a tuple |
| type: `(String, String, Option<u64>, f64, f64)`. |
| |
| ```no_run |
| //tutorial-read-serde-02.rs |
| # use std::{error::Error, io, process}; |
| # |
| // This introduces a type alias so that we can conveniently reference our |
| // record type. |
| type Record = (String, String, Option<u64>, f64, f64); |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| // Instead of creating an iterator with the `records` method, we create |
| // an iterator with the `deserialize` method. |
| for result in rdr.deserialize() { |
| // We must tell Serde what type we want to deserialize into. |
| let record: Record = result?; |
| println!("{:?}", record); |
| } |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| Running this code should show similar output as previous examples: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor < uspop.csv |
| ("Davidsons Landing", "AK", None, 65.2419444, -165.2716667) |
| ("Kenai", "AK", Some(7610), 60.5544444, -151.2583333) |
| ("Oakman", "AL", None, 33.7133333, -87.3886111) |
| # ... and much more |
| ``` |
| |
| One of the downsides of using Serde this way is that the type you use must |
| match the order of fields as they appear in each record. This can be a pain |
| if your CSV data has a header record, since you might tend to think about each |
| field as a value of a particular named field rather than as a numbered field. |
| One way we might achieve this is to deserialize our record into a map type like |
| [`HashMap`](https://doc.rust-lang.org/std/collections/struct.HashMap.html) |
| or |
| [`BTreeMap`](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html). |
| The next example shows how, and in particular, notice that the only thing that |
| changed from the last example is the definition of the `Record` type alias and |
| a new `use` statement that imports `HashMap` from the standard library: |
| |
| ```no_run |
| //tutorial-read-serde-03.rs |
| use std::collections::HashMap; |
| # use std::{error::Error, io, process}; |
| |
| // This introduces a type alias so that we can conveniently reference our |
| // record type. |
| type Record = HashMap<String, String>; |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| for result in rdr.deserialize() { |
| let record: Record = result?; |
| println!("{:?}", record); |
| } |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| Running this program shows similar results as before, but each record is |
| printed as a map: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor < uspop.csv |
| {"City": "Davidsons Landing", "Latitude": "65.2419444", "State": "AK", "Population": "", "Longitude": "-165.2716667"} |
| {"City": "Kenai", "Population": "7610", "State": "AK", "Longitude": "-151.2583333", "Latitude": "60.5544444"} |
| {"State": "AL", "City": "Oakman", "Longitude": "-87.3886111", "Population": "", "Latitude": "33.7133333"} |
| ``` |
| |
| This method works especially well if you need to read CSV data with header |
| records, but whose exact structure isn't known until your program runs. |
| However, in our case, we know the structure of the data in `uspop.csv`. In |
| particular, with the `HashMap` approach, we've lost the specific types we had |
| for each field in the previous example when we deserialized each record into a |
| `(String, String, Option<u64>, f64, f64)`. Is there a way to identify fields |
| by their corresponding header name *and* assign each field its own unique |
| type? The answer is yes, but we'll need to bring in Serde's `derive` feature |
| first. You can do that by adding this to the `[dependencies]` section of your |
| `Cargo.toml` file: |
| |
| ```text |
| serde = { version = "1", features = ["derive"] } |
| ``` |
| |
| With these crates added to our project, we can now define our own custom struct |
| that represents our record. We then ask Serde to automatically write the glue |
| code required to populate our struct from a CSV record. The next example shows |
| how. Don't miss the new Serde imports! |
| |
| ```no_run |
| //tutorial-read-serde-04.rs |
| # #![allow(dead_code)] |
| # use std::{error::Error, io, process}; |
| |
| // This lets us write `#[derive(Deserialize)]`. |
| use serde::Deserialize; |
| |
| // We don't need to derive `Debug` (which doesn't require Serde), but it's a |
| // good habit to do it for all your types. |
| // |
| // Notice that the field names in this struct are NOT in the same order as |
| // the fields in the CSV data! |
| #[derive(Debug, Deserialize)] |
| #[serde(rename_all = "PascalCase")] |
| struct Record { |
| latitude: f64, |
| longitude: f64, |
| population: Option<u64>, |
| city: String, |
| state: String, |
| } |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| for result in rdr.deserialize() { |
| let record: Record = result?; |
| println!("{:?}", record); |
| // Try this if you don't like each record smushed on one line: |
| // println!("{:#?}", record); |
| } |
| Ok(()) |
| } |
| |
| fn main() { |
| if let Err(err) = run() { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| ``` |
| |
| Compile and run this program to see similar output as before: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor < uspop.csv |
| Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" } |
| Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" } |
| Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" } |
| ``` |
| |
| Once again, we didn't need to change our `run` function at all: we're still |
| iterating over records using the `deserialize` iterator that we started with |
| in the beginning of this section. The only thing that changed in this example |
| was the definition of the `Record` type and a new `use` statement. Our `Record` |
| type is now a custom struct that we defined instead of a type alias, and as a |
| result, Serde doesn't know how to deserialize it by default. However, a special |
| compiler plugin provided by Serde is available, which will read your struct |
| definition at compile time and generate code that will deserialize a CSV record |
| into a `Record` value. To see what happens if you leave out the automatic |
| derive, change `#[derive(Debug, Deserialize)]` to `#[derive(Debug)]`. |
| |
| One other thing worth mentioning in this example is the use of |
| `#[serde(rename_all = "PascalCase")]`. This directive helps Serde map your |
| struct's field names to the header names in the CSV data. If you recall, our |
| header record is: |
| |
| ```text |
| City,State,Population,Latitude,Longitude |
| ``` |
| |
| Notice that each name is capitalized, but the fields in our struct are not. The |
| `#[serde(rename_all = "PascalCase")]` directive fixes that by interpreting each |
| field in `PascalCase`, where the first letter of the field is capitalized. If |
| we didn't tell Serde about the name remapping, then the program will quit with |
| an error: |
| |
| ```text |
| $ ./target/debug/csvtutor < uspop.csv |
| CSV deserialize error: record 1 (line: 2, byte: 41): missing field `latitude` |
| ``` |
| |
| We could have fixed this through other means. For example, we could have used |
| capital letters in our field names: |
| |
| ```ignore |
| #[derive(Debug, Deserialize)] |
| struct Record { |
| Latitude: f64, |
| Longitude: f64, |
| Population: Option<u64>, |
| City: String, |
| State: String, |
| } |
| ``` |
| |
| However, this violates Rust naming style. (In fact, the Rust compiler |
| will even warn you that the names do not follow convention!) |
| |
| Another way to fix this is to ask Serde to rename each field individually. This |
| is useful when there is no consistent name mapping from fields to header names: |
| |
| ```ignore |
| #[derive(Debug, Deserialize)] |
| struct Record { |
| #[serde(rename = "Latitude")] |
| latitude: f64, |
| #[serde(rename = "Longitude")] |
| longitude: f64, |
| #[serde(rename = "Population")] |
| population: Option<u64>, |
| #[serde(rename = "City")] |
| city: String, |
| #[serde(rename = "State")] |
| state: String, |
| } |
| ``` |
| |
| To read more about renaming fields and about other Serde directives, please |
| consult the |
| [Serde documentation on attributes](https://serde.rs/attributes.html). |
| |
| ## Handling invalid data with Serde |
| |
| In this section we will see a brief example of how to deal with data that isn't |
| clean. To do this exercise, we'll work with a slightly tweaked version of the |
| US population data we've been using throughout this tutorial. This version of |
| the data is slightly messier than what we've been using. You can get it like |
| so: |
| |
| ```text |
| $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-null.csv' |
| ``` |
| |
| Let's start by running our program from the previous section: |
| |
| ```no_run |
| //tutorial-read-serde-invalid-01.rs |
| # #![allow(dead_code)] |
| # use std::{error::Error, io, process}; |
| # |
| # use serde::Deserialize; |
| # |
| #[derive(Debug, Deserialize)] |
| #[serde(rename_all = "PascalCase")] |
| struct Record { |
| latitude: f64, |
| longitude: f64, |
| population: Option<u64>, |
| city: String, |
| state: String, |
| } |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| for result in rdr.deserialize() { |
| let record: Record = result?; |
| println!("{:?}", record); |
| } |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| Compile and run it on our messier data: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor < uspop-null.csv |
| Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" } |
| Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" } |
| Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" } |
| # ... more records |
| CSV deserialize error: record 42 (line: 43, byte: 1710): field 2: invalid digit found in string |
| ``` |
| |
| Oops! What happened? The program printed several records, but stopped when it |
| tripped over a deserialization problem. The error message says that it found |
| an invalid digit in the field at index `2` (which is the `Population` field) |
| on line 43. What does line 43 look like? |
| |
| ```text |
| $ head -n 43 uspop-null.csv | tail -n1 |
| Flint Springs,KY,NULL,37.3433333,-86.7136111 |
| ``` |
| |
| Ah! The third field (index `2`) is supposed to either be empty or contain a |
| population count. However, in this data, it seems that `NULL` sometimes appears |
| as a value, presumably to indicate that there is no count available. |
| |
| The problem with our current program is that it fails to read this record |
| because it doesn't know how to deserialize a `NULL` string into an |
| `Option<u64>`. That is, a `Option<u64>` either corresponds to an empty field |
| or an integer. |
| |
| To fix this, we tell Serde to convert any deserialization errors on this field |
| to a `None` value, as shown in this next example: |
| |
| ```no_run |
| //tutorial-read-serde-invalid-02.rs |
| # #![allow(dead_code)] |
| # use std::{error::Error, io, process}; |
| # |
| # use serde::Deserialize; |
| #[derive(Debug, Deserialize)] |
| #[serde(rename_all = "PascalCase")] |
| struct Record { |
| latitude: f64, |
| longitude: f64, |
| #[serde(deserialize_with = "csv::invalid_option")] |
| population: Option<u64>, |
| city: String, |
| state: String, |
| } |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| for result in rdr.deserialize() { |
| let record: Record = result?; |
| println!("{:?}", record); |
| } |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| If you compile and run this example, then it should run to completion just |
| like the other examples: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor < uspop-null.csv |
| Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" } |
| Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" } |
| Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" } |
| # ... and more |
| ``` |
| |
| The only change in this example was adding this attribute to the `population` |
| field in our `Record` type: |
| |
| ```ignore |
| #[serde(deserialize_with = "csv::invalid_option")] |
| ``` |
| |
| The |
| [`invalid_option`](../fn.invalid_option.html) |
| function is a generic helper function that does one very simple thing: when |
| applied to `Option` fields, it will convert any deserialization error into a |
| `None` value. This is useful when you need to work with messy CSV data. |
| |
| # Writing CSV |
| |
| In this section we'll show a few examples that write CSV data. Writing CSV data |
| tends to be a bit more straight-forward than reading CSV data, since you get to |
| control the output format. |
| |
| Let's start with the most basic example: writing a few CSV records to `stdout`. |
| |
| ```no_run |
| //tutorial-write-01.rs |
| use std::{error::Error, io, process}; |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut wtr = csv::Writer::from_writer(io::stdout()); |
| // Since we're writing records manually, we must explicitly write our |
| // header record. A header record is written the same way that other |
| // records are written. |
| wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; |
| wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?; |
| wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?; |
| wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?; |
| |
| // A CSV writer maintains an internal buffer, so it's important |
| // to flush the buffer when you're done. |
| wtr.flush()?; |
| Ok(()) |
| } |
| |
| fn main() { |
| if let Err(err) = run() { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| ``` |
| |
| Compiling and running this example results in CSV data being printed: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor |
| City,State,Population,Latitude,Longitude |
| Davidsons Landing,AK,,65.2419444,-165.2716667 |
| Kenai,AK,7610,60.5544444,-151.2583333 |
| Oakman,AL,,33.7133333,-87.3886111 |
| ``` |
| |
| Before moving on, it's worth taking a closer look at the `write_record` |
| method. In this example, it looks rather simple, but if you're new to Rust then |
| its type signature might look a little daunting: |
| |
| ```ignore |
| pub fn write_record<I, T>(&mut self, record: I) -> csv::Result<()> |
| where I: IntoIterator<Item=T>, T: AsRef<[u8]> |
| { |
| // implementation elided |
| } |
| ``` |
| |
| To understand the type signature, we can break it down piece by piece. |
| |
| 1. The method takes two parameters: `self` and `record`. |
| 2. `self` is a special parameter that corresponds to the `Writer` itself. |
| 3. `record` is the CSV record we'd like to write. Its type is `I`, which is |
| a generic type. |
| 4. In the method's `where` clause, the `I` type is constrained by the |
| `IntoIterator<Item=T>` bound. What that means is that `I` must satisfy the |
| `IntoIterator` trait. If you look at the documentation of the |
| [`IntoIterator` trait](https://doc.rust-lang.org/std/iter/trait.IntoIterator.html), |
| then we can see that it describes types that can build iterators. In this |
| case, we want an iterator that yields *another* generic type `T`, where |
| `T` is the type of each field we want to write. |
| 5. `T` also appears in the method's `where` clause, but its constraint is the |
| `AsRef<[u8]>` bound. The `AsRef` trait is a way to describe zero cost |
| conversions between types in Rust. In this case, the `[u8]` in `AsRef<[u8]>` |
| means that we want to be able to *borrow* a slice of bytes from `T`. |
| The CSV writer will take these bytes and write them as a single field. |
| The `AsRef<[u8]>` bound is useful because types like `String`, `&str`, |
| `Vec<u8>` and `&[u8]` all satisfy it. |
| 6. Finally, the method returns a `csv::Result<()>`, which is short-hand for |
| `Result<(), csv::Error>`. That means `write_record` either returns nothing |
| on success or returns a `csv::Error` on failure. |
| |
| Now, let's apply our new found understanding of the type signature of |
| `write_record`. If you recall, in our previous example, we used it like so: |
| |
| ```ignore |
| wtr.write_record(&["field 1", "field 2", "etc"])?; |
| ``` |
| |
| So how do the types match up? Well, the type of each of our fields in this |
| code is `&'static str` (which is the type of a string literal in Rust). Since |
| we put them in a slice literal, the type of our parameter is |
| `&'static [&'static str]`, or more succinctly written as `&[&str]` without the |
| lifetime annotations. Since slices satisfy the `IntoIterator` bound and |
| strings satisfy the `AsRef<[u8]>` bound, this ends up being a legal call. |
| |
| Here are a few more examples of ways you can call `write_record`: |
| |
| ```no_run |
| # use csv; |
| # let mut wtr = csv::Writer::from_writer(vec![]); |
| // A slice of byte strings. |
| wtr.write_record(&[b"a", b"b", b"c"]); |
| // A vector. |
| wtr.write_record(vec!["a", "b", "c"]); |
| // A string record. |
| wtr.write_record(&csv::StringRecord::from(vec!["a", "b", "c"])); |
| // A byte record. |
| wtr.write_record(&csv::ByteRecord::from(vec!["a", "b", "c"])); |
| ``` |
| |
| Finally, the example above can be easily adapted to write to a file instead |
| of `stdout`: |
| |
| ```no_run |
| //tutorial-write-02.rs |
| use std::{ |
| env, |
| error::Error, |
| ffi::OsString, |
| process, |
| }; |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let file_path = get_first_arg()?; |
| let mut wtr = csv::Writer::from_path(file_path)?; |
| |
| wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; |
| wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?; |
| wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?; |
| wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?; |
| |
| wtr.flush()?; |
| Ok(()) |
| } |
| |
| /// Returns the first positional argument sent to this process. If there are no |
| /// positional arguments, then this returns an error. |
| fn get_first_arg() -> Result<OsString, Box<dyn Error>> { |
| match env::args_os().nth(1) { |
| None => Err(From::from("expected 1 argument, but got none")), |
| Some(file_path) => Ok(file_path), |
| } |
| } |
| |
| fn main() { |
| if let Err(err) = run() { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| ``` |
| |
| ## Writing tab separated values |
| |
| In the previous section, we saw how to write some simple CSV data to `stdout` |
| that looked like this: |
| |
| ```text |
| City,State,Population,Latitude,Longitude |
| Davidsons Landing,AK,,65.2419444,-165.2716667 |
| Kenai,AK,7610,60.5544444,-151.2583333 |
| Oakman,AL,,33.7133333,-87.3886111 |
| ``` |
| |
| You might wonder to yourself: what's the point of using a CSV writer if the |
| data is so simple? Well, the benefit of a CSV writer is that it can handle all |
| types of data without sacrificing the integrity of your data. That is, it knows |
| when to quote fields that contain special CSV characters (like commas or new |
| lines) or escape literal quotes that appear in your data. The CSV writer can |
| also be easily configured to use different delimiters or quoting strategies. |
| |
| In this section, we'll take a look a look at how to tweak some of the settings |
| on a CSV writer. In particular, we'll write TSV ("tab separated values") |
| instead of CSV, and we'll ask the CSV writer to quote all non-numeric fields. |
| Here's an example: |
| |
| ```no_run |
| //tutorial-write-delimiter-01.rs |
| # use std::{error::Error, io, process}; |
| # |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut wtr = csv::WriterBuilder::new() |
| .delimiter(b'\t') |
| .quote_style(csv::QuoteStyle::NonNumeric) |
| .from_writer(io::stdout()); |
| |
| wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; |
| wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?; |
| wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?; |
| wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?; |
| |
| wtr.flush()?; |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| Compiling and running this example gives: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor |
| "City" "State" "Population" "Latitude" "Longitude" |
| "Davidsons Landing" "AK" "" 65.2419444 -165.2716667 |
| "Kenai" "AK" 7610 60.5544444 -151.2583333 |
| "Oakman" "AL" "" 33.7133333 -87.3886111 |
| ``` |
| |
| In this example, we used a new type |
| [`QuoteStyle`](../enum.QuoteStyle.html). |
| The `QuoteStyle` type represents the different quoting strategies available |
| to you. The default is to add quotes to fields only when necessary. This |
| probably works for most use cases, but you can also ask for quotes to always |
| be put around fields, to never be put around fields or to always be put around |
| non-numeric fields. |
| |
| ## Writing with Serde |
| |
| Just like the CSV reader supports automatic deserialization into Rust types |
| with Serde, the CSV writer supports automatic serialization from Rust types |
| into CSV records using Serde. In this section, we'll learn how to use it. |
| |
| As with reading, let's start by seeing how we can serialize a Rust tuple. |
| |
| ```no_run |
| //tutorial-write-serde-01.rs |
| # use std::{error::Error, io, process}; |
| # |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut wtr = csv::Writer::from_writer(io::stdout()); |
| |
| // We still need to write headers manually. |
| wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; |
| |
| // But now we can write records by providing a normal Rust value. |
| // |
| // Note that the odd `None::<u64>` syntax is required because `None` on |
| // its own doesn't have a concrete type, but Serde needs a concrete type |
| // in order to serialize it. That is, `None` has type `Option<T>` but |
| // `None::<u64>` has type `Option<u64>`. |
| wtr.serialize(("Davidsons Landing", "AK", None::<u64>, 65.2419444, -165.2716667))?; |
| wtr.serialize(("Kenai", "AK", Some(7610), 60.5544444, -151.2583333))?; |
| wtr.serialize(("Oakman", "AL", None::<u64>, 33.7133333, -87.3886111))?; |
| |
| wtr.flush()?; |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| Compiling and running this program gives the expected output: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor |
| City,State,Population,Latitude,Longitude |
| Davidsons Landing,AK,,65.2419444,-165.2716667 |
| Kenai,AK,7610,60.5544444,-151.2583333 |
| Oakman,AL,,33.7133333,-87.3886111 |
| ``` |
| |
| The key thing to note in the above example is the use of `serialize` instead |
| of `write_record` to write our data. In particular, `write_record` is used |
| when writing a simple record that contains string-like data only. On the other |
| hand, `serialize` is used when your data consists of more complex values like |
| numbers, floats or optional values. Of course, you could always convert the |
| complex values to strings and then use `write_record`, but Serde can do it for |
| you automatically. |
| |
| As with reading, we can also serialize custom structs as CSV records. As a |
| bonus, the fields in a struct will automatically be written as a header |
| record! |
| |
| To write custom structs as CSV records, we'll need to make use of Serde's |
| automatic `derive` feature again. As in the |
| [previous section on reading with Serde](#reading-with-serde), |
| we'll need to add a couple crates to our `[dependencies]` section in our |
| `Cargo.toml` (if they aren't already there): |
| |
| ```text |
| serde = { version = "1", features = ["derive"] } |
| ``` |
| |
| And we'll also need to add a new `use` statement to our code, for Serde, as |
| shown in the example: |
| |
| ```no_run |
| //tutorial-write-serde-02.rs |
| use std::{error::Error, io, process}; |
| |
| use serde::Serialize; |
| |
| // Note that structs can derive both Serialize and Deserialize! |
| #[derive(Debug, Serialize)] |
| #[serde(rename_all = "PascalCase")] |
| struct Record<'a> { |
| city: &'a str, |
| state: &'a str, |
| population: Option<u64>, |
| latitude: f64, |
| longitude: f64, |
| } |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| let mut wtr = csv::Writer::from_writer(io::stdout()); |
| |
| wtr.serialize(Record { |
| city: "Davidsons Landing", |
| state: "AK", |
| population: None, |
| latitude: 65.2419444, |
| longitude: -165.2716667, |
| })?; |
| wtr.serialize(Record { |
| city: "Kenai", |
| state: "AK", |
| population: Some(7610), |
| latitude: 60.5544444, |
| longitude: -151.2583333, |
| })?; |
| wtr.serialize(Record { |
| city: "Oakman", |
| state: "AL", |
| population: None, |
| latitude: 33.7133333, |
| longitude: -87.3886111, |
| })?; |
| |
| wtr.flush()?; |
| Ok(()) |
| } |
| |
| fn main() { |
| if let Err(err) = run() { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| ``` |
| |
| Compiling and running this example has the same output as last time, even |
| though we didn't explicitly write a header record: |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor |
| City,State,Population,Latitude,Longitude |
| Davidsons Landing,AK,,65.2419444,-165.2716667 |
| Kenai,AK,7610,60.5544444,-151.2583333 |
| Oakman,AL,,33.7133333,-87.3886111 |
| ``` |
| |
| In this case, the `serialize` method noticed that we were writing a struct |
| with field names. When this happens, `serialize` will automatically write a |
| header record (only if no other records have been written) that consists of |
| the fields in the struct in the order in which they are defined. Note that |
| this behavior can be disabled with the |
| [`WriterBuilder::has_headers`](../struct.WriterBuilder.html#method.has_headers) |
| method. |
| |
| It's also worth pointing out the use of a *lifetime parameter* in our `Record` |
| struct: |
| |
| ```ignore |
| struct Record<'a> { |
| city: &'a str, |
| state: &'a str, |
| population: Option<u64>, |
| latitude: f64, |
| longitude: f64, |
| } |
| ``` |
| |
| The `'a` lifetime parameter corresponds to the lifetime of the `city` and |
| `state` string slices. This says that the `Record` struct contains *borrowed* |
| data. We could have written our struct without borrowing any data, and |
| therefore, without any lifetime parameters: |
| |
| ```ignore |
| struct Record { |
| city: String, |
| state: String, |
| population: Option<u64>, |
| latitude: f64, |
| longitude: f64, |
| } |
| ``` |
| |
| However, since we had to replace our borrowed `&str` types with owned `String` |
| types, we're now forced to allocate a new `String` value for both of `city` |
| and `state` for every record that we write. There's no intrinsic problem with |
| doing that, but it might be a bit wasteful. |
| |
| For more examples and more details on the rules for serialization, please see |
| the |
| [`Writer::serialize`](../struct.Writer.html#method.serialize) |
| method. |
| |
| # Pipelining |
| |
| In this section, we're going to cover a few examples that demonstrate programs |
| that take CSV data as input, and produce possibly transformed or filtered CSV |
| data as output. This shows how to write a complete program that efficiently |
| reads and writes CSV data. Rust is well positioned to perform this task, since |
| you'll get great performance with the convenience of a high level CSV library. |
| |
| ## Filter by search |
| |
| The first example of CSV pipelining we'll look at is a simple filter. It takes |
| as input some CSV data on stdin and a single string query as its only |
| positional argument, and it will produce as output CSV data that only contains |
| rows with a field that matches the query. |
| |
| ```no_run |
| //tutorial-pipeline-search-01.rs |
| use std::{env, error::Error, io, process}; |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| // Get the query from the positional arguments. |
| // If one doesn't exist, return an error. |
| let query = match env::args().nth(1) { |
| None => return Err(From::from("expected 1 argument, but got none")), |
| Some(query) => query, |
| }; |
| |
| // Build CSV readers and writers to stdin and stdout, respectively. |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| let mut wtr = csv::Writer::from_writer(io::stdout()); |
| |
| // Before reading our data records, we should write the header record. |
| wtr.write_record(rdr.headers()?)?; |
| |
| // Iterate over all the records in `rdr`, and write only records containing |
| // `query` to `wtr`. |
| for result in rdr.records() { |
| let record = result?; |
| if record.iter().any(|field| field == &query) { |
| wtr.write_record(&record)?; |
| } |
| } |
| |
| // CSV writers use an internal buffer, so we should always flush when done. |
| wtr.flush()?; |
| Ok(()) |
| } |
| |
| fn main() { |
| if let Err(err) = run() { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| ``` |
| |
| If we compile and run this program with a query of `MA` on `uspop.csv`, we'll |
| see that only one record matches: |
| |
| ```text |
| $ cargo build |
| $ ./csvtutor MA < uspop.csv |
| City,State,Population,Latitude,Longitude |
| Reading,MA,23441,42.5255556,-71.0958333 |
| ``` |
| |
| This example doesn't actually introduce anything new. It merely combines what |
| you've already learned about CSV readers and writers from previous sections. |
| |
| Let's add a twist to this example. In the real world, you're often faced with |
| messy CSV data that might not be encoded correctly. One example you might come |
| across is CSV data encoded in |
| [Latin-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). |
| Unfortunately, for the examples we've seen so far, our CSV reader assumes that |
| all of the data is UTF-8. Since all of the data we've worked on has been |
| ASCII---which is a subset of both Latin-1 and UTF-8---we haven't had any |
| problems. But let's introduce a slightly tweaked version of our `uspop.csv` |
| file that contains an encoding of a Latin-1 character that is invalid UTF-8. |
| You can get the data like so: |
| |
| ```text |
| $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-latin1.csv' |
| ``` |
| |
| Even though I've already given away the problem, let's see what happen when |
| we try to run our previous example on this new data: |
| |
| ```text |
| $ ./csvtutor MA < uspop-latin1.csv |
| City,State,Population,Latitude,Longitude |
| CSV parse error: record 3 (line 4, field: 0, byte: 125): invalid utf-8: invalid UTF-8 in field 0 near byte index 0 |
| ``` |
| |
| The error message tells us exactly what's wrong. Let's take a look at line 4 |
| to see what we're dealing with: |
| |
| ```text |
| $ head -n4 uspop-latin1.csv | tail -n1 |
| Õakman,AL,,33.7133333,-87.3886111 |
| ``` |
| |
| In this case, the very first character is the Latin-1 `Õ`, which is encoded as |
| the byte `0xD5`, which is in turn invalid UTF-8. So what do we do now that our |
| CSV parser has choked on our data? You have two choices. The first is to go in |
| and fix up your CSV data so that it's valid UTF-8. This is probably a good |
| idea anyway, and tools like `iconv` can help with the task of transcoding. |
| But if you can't or don't want to do that, then you can instead read CSV data |
| in a way that is mostly encoding agnostic (so long as ASCII is still a valid |
| subset). The trick is to use *byte records* instead of *string records*. |
| |
| Thus far, we haven't actually talked much about the type of a record in this |
| library, but now is a good time to introduce them. There are two of them, |
| [`StringRecord`](../struct.StringRecord.html) |
| and |
| [`ByteRecord`](../struct.ByteRecord.html). |
| Each them represent a single record in CSV data, where a record is a sequence |
| of an arbitrary number of fields. The only difference between `StringRecord` |
| and `ByteRecord` is that `StringRecord` is guaranteed to be valid UTF-8, where |
| as `ByteRecord` contains arbitrary bytes. |
| |
| Armed with that knowledge, we can now begin to understand why we saw an error |
| when we ran the last example on data that wasn't UTF-8. Namely, when we call |
| `records`, we get back an iterator of `StringRecord`. Since `StringRecord` is |
| guaranteed to be valid UTF-8, trying to build a `StringRecord` with invalid |
| UTF-8 will result in the error that we see. |
| |
| All we need to do to make our example work is to switch from a `StringRecord` |
| to a `ByteRecord`. This means using `byte_records` to create our iterator |
| instead of `records`, and similarly using `byte_headers` instead of `headers` |
| if we think our header data might contain invalid UTF-8 as well. Here's the |
| change: |
| |
| ```no_run |
| //tutorial-pipeline-search-02.rs |
| # use std::{env, error::Error, io, process}; |
| # |
| fn run() -> Result<(), Box<dyn Error>> { |
| let query = match env::args().nth(1) { |
| None => return Err(From::from("expected 1 argument, but got none")), |
| Some(query) => query, |
| }; |
| |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| let mut wtr = csv::Writer::from_writer(io::stdout()); |
| |
| wtr.write_record(rdr.byte_headers()?)?; |
| |
| for result in rdr.byte_records() { |
| let record = result?; |
| // `query` is a `String` while `field` is now a `&[u8]`, so we'll |
| // need to convert `query` to `&[u8]` before doing a comparison. |
| if record.iter().any(|field| field == query.as_bytes()) { |
| wtr.write_record(&record)?; |
| } |
| } |
| |
| wtr.flush()?; |
| Ok(()) |
| } |
| # |
| # fn main() { |
| # if let Err(err) = run() { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| ``` |
| |
| Compiling and running this now yields the same results as our first example, |
| but this time it works on data that isn't valid UTF-8. |
| |
| ```text |
| $ cargo build |
| $ ./csvtutor MA < uspop-latin1.csv |
| City,State,Population,Latitude,Longitude |
| Reading,MA,23441,42.5255556,-71.0958333 |
| ``` |
| |
| ## Filter by population count |
| |
| In this section, we will show another example program that both reads and |
| writes CSV data, but instead of dealing with arbitrary records, we will use |
| Serde to deserialize and serialize records with specific types. |
| |
| For this program, we'd like to be able to filter records in our population data |
| by population count. Specifically, we'd like to see which records meet a |
| certain population threshold. In addition to using a simple inequality, we must |
| also account for records that have a missing population count. This is where |
| types like `Option<T>` come in handy, because the compiler will force us to |
| consider the case when the population count is missing. |
| |
| Since we're using Serde in this example, don't forget to add the Serde |
| dependencies to your `Cargo.toml` in your `[dependencies]` section if they |
| aren't already there: |
| |
| ```text |
| serde = { version = "1", features = ["derive"] } |
| ``` |
| |
| Now here's the code: |
| |
| ```no_run |
| //tutorial-pipeline-pop-01.rs |
| # use std::{env, error::Error, io, process}; |
| |
| use serde::{Deserialize, Serialize}; |
| |
| // Unlike previous examples, we derive both Deserialize and Serialize. This |
| // means we'll be able to automatically deserialize and serialize this type. |
| #[derive(Debug, Deserialize, Serialize)] |
| #[serde(rename_all = "PascalCase")] |
| struct Record { |
| city: String, |
| state: String, |
| population: Option<u64>, |
| latitude: f64, |
| longitude: f64, |
| } |
| |
| fn run() -> Result<(), Box<dyn Error>> { |
| // Get the query from the positional arguments. |
| // If one doesn't exist or isn't an integer, return an error. |
| let minimum_pop: u64 = match env::args().nth(1) { |
| None => return Err(From::from("expected 1 argument, but got none")), |
| Some(arg) => arg.parse()?, |
| }; |
| |
| // Build CSV readers and writers to stdin and stdout, respectively. |
| // Note that we don't need to write headers explicitly. Since we're |
| // serializing a custom struct, that's done for us automatically. |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| let mut wtr = csv::Writer::from_writer(io::stdout()); |
| |
| // Iterate over all the records in `rdr`, and write only records containing |
| // a population that is greater than or equal to `minimum_pop`. |
| for result in rdr.deserialize() { |
| // Remember that when deserializing, we must use a type hint to |
| // indicate which type we want to deserialize our record into. |
| let record: Record = result?; |
| |
| // `map_or` is a combinator on `Option`. It take two parameters: |
| // a value to use when the `Option` is `None` (i.e., the record has |
| // no population count) and a closure that returns another value of |
| // the same type when the `Option` is `Some`. In this case, we test it |
| // against our minimum population count that we got from the command |
| // line. |
| if record.population.map_or(false, |pop| pop >= minimum_pop) { |
| wtr.serialize(record)?; |
| } |
| } |
| |
| // CSV writers use an internal buffer, so we should always flush when done. |
| wtr.flush()?; |
| Ok(()) |
| } |
| |
| fn main() { |
| if let Err(err) = run() { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| ``` |
| |
| If we compile and run our program with a minimum threshold of `100000`, we |
| should see three matching records. Notice that the headers were added even |
| though we never explicitly wrote them! |
| |
| ```text |
| $ cargo build |
| $ ./target/debug/csvtutor 100000 < uspop.csv |
| City,State,Population,Latitude,Longitude |
| Fontana,CA,169160,34.0922222,-117.4341667 |
| Bridgeport,CT,139090,41.1669444,-73.2052778 |
| Indianapolis,IN,773283,39.7683333,-86.1580556 |
| ``` |
| |
| # Performance |
| |
| In this section, we'll go over how to squeeze the most juice out of our CSV |
| reader. As it happens, most of the APIs we've seen so far were designed with |
| high level convenience in mind, and that often comes with some costs. For the |
| most part, those costs revolve around unnecessary allocations. Therefore, most |
| of the section will show how to do CSV parsing with as little allocation as |
| possible. |
| |
| There are two critical preliminaries we must cover. |
| |
| Firstly, when you care about performance, you should compile your code |
| with `cargo build --release` instead of `cargo build`. The `--release` |
| flag instructs the compiler to spend more time optimizing your code. When |
| compiling with the `--release` flag, you'll find your compiled program at |
| `target/release/csvtutor` instead of `target/debug/csvtutor`. Throughout this |
| tutorial, we've used `cargo build` because our dataset was small and we weren't |
| focused on speed. The downside of `cargo build --release` is that it will take |
| longer than `cargo build`. |
| |
| Secondly, the dataset we've used throughout this tutorial only has 100 records. |
| We'd have to try really hard to cause our program to run slowly on 100 records, |
| even when we compile without the `--release` flag. Therefore, in order to |
| actually witness a performance difference, we need a bigger dataset. To get |
| such a dataset, we'll use the original source of `uspop.csv`. **Warning: the |
| download is 41MB compressed and decompresses to 145MB.** |
| |
| ```text |
| $ curl -LO http://burntsushi.net/stuff/worldcitiespop.csv.gz |
| $ gunzip worldcitiespop.csv.gz |
| $ wc worldcitiespop.csv |
| 3173959 5681543 151492068 worldcitiespop.csv |
| $ md5sum worldcitiespop.csv |
| 6198bd180b6d6586626ecbf044c1cca5 worldcitiespop.csv |
| ``` |
| |
| Finally, it's worth pointing out that this section is not attempting to |
| present a rigorous set of benchmarks. We will stay away from rigorous analysis |
| and instead rely a bit more on wall clock times and intuition. |
| |
| ## Amortizing allocations |
| |
| In order to measure performance, we must be careful about what it is we're |
| measuring. We must also be careful to not change the thing we're measuring as |
| we make improvements to the code. For this reason, we will focus on measuring |
| how long it takes to count the number of records corresponding to city |
| population counts in Massachusetts. This represents a very small amount of work |
| that requires us to visit every record, and therefore represents a decent way |
| to measure how long it takes to do CSV parsing. |
| |
| Before diving into our first optimization, let's start with a baseline by |
| adapting a previous example to count the number of records in |
| `worldcitiespop.csv`: |
| |
| ```no_run |
| //tutorial-perf-alloc-01.rs |
| use std::{error::Error, io, process}; |
| |
| fn run() -> Result<u64, Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| |
| let mut count = 0; |
| for result in rdr.records() { |
| let record = result?; |
| if &record[0] == "us" && &record[3] == "MA" { |
| count += 1; |
| } |
| } |
| Ok(count) |
| } |
| |
| fn main() { |
| match run() { |
| Ok(count) => { |
| println!("{}", count); |
| } |
| Err(err) => { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| } |
| ``` |
| |
| Now let's compile and run it and see what kind of timing we get. Don't forget |
| to compile with the `--release` flag. (For grins, try compiling without the |
| `--release` flag and see how long it takes to run the program!) |
| |
| ```text |
| $ cargo build --release |
| $ time ./target/release/csvtutor < worldcitiespop.csv |
| 2176 |
| |
| real 0m0.645s |
| user 0m0.627s |
| sys 0m0.017s |
| ``` |
| |
| All right, so what's the first thing we can do to make this faster? This |
| section promised to speed things up by amortizing allocation, but we can do |
| something even simpler first: iterate over |
| [`ByteRecord`](../struct.ByteRecord.html)s |
| instead of |
| [`StringRecord`](../struct.StringRecord.html)s. |
| If you recall from a previous section, a `StringRecord` is guaranteed to be |
| valid UTF-8, and therefore must validate that its contents is actually UTF-8. |
| (If validation fails, then the CSV reader will return an error.) If we remove |
| that validation from our program, then we can realize a nice speed boost as |
| shown in the next example: |
| |
| ```no_run |
| //tutorial-perf-alloc-02.rs |
| # use std::{error::Error, io, process}; |
| # |
| fn run() -> Result<u64, Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| |
| let mut count = 0; |
| for result in rdr.byte_records() { |
| let record = result?; |
| if &record[0] == b"us" && &record[3] == b"MA" { |
| count += 1; |
| } |
| } |
| Ok(count) |
| } |
| # |
| # fn main() { |
| # match run() { |
| # Ok(count) => { |
| # println!("{}", count); |
| # } |
| # Err(err) => { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| # } |
| ``` |
| |
| And now compile and run: |
| |
| ```text |
| $ cargo build --release |
| $ time ./target/release/csvtutor < worldcitiespop.csv |
| 2176 |
| |
| real 0m0.429s |
| user 0m0.403s |
| sys 0m0.023s |
| ``` |
| |
| Our program is now approximately 30% faster, all because we removed UTF-8 |
| validation. But was it actually okay to remove UTF-8 validation? What have we |
| lost? In this case, it is perfectly acceptable to drop UTF-8 validation and use |
| `ByteRecord` instead because all we're doing with the data in the record is |
| comparing two of its fields to raw bytes: |
| |
| ```ignore |
| if &record[0] == b"us" && &record[3] == b"MA" { |
| count += 1; |
| } |
| ``` |
| |
| In particular, it doesn't matter whether `record` is valid UTF-8 or not, since |
| we're checking for equality on the raw bytes themselves. |
| |
| UTF-8 validation via `StringRecord` is useful because it provides access to |
| fields as `&str` types, where as `ByteRecord` provides fields as `&[u8]` types. |
| `&str` is the type of a borrowed string in Rust, which provides convenient |
| access to string APIs like substring search. Strings are also frequently used |
| in other areas, so they tend to be a useful thing to have. Therefore, sticking |
| with `StringRecord` is a good default, but if you need the extra speed and can |
| deal with arbitrary bytes, then switching to `ByteRecord` might be a good idea. |
| |
| Moving on, let's try to get another speed boost by amortizing allocation. |
| Amortizing allocation is the technique that creates an allocation once (or |
| very rarely), and then attempts to reuse it instead of creating additional |
| allocations. In the case of the previous examples, we used iterators created |
| by the `records` and `byte_records` methods on a CSV reader. These iterators |
| allocate a new record for every item that it yields, which in turn corresponds |
| to a new allocation. It does this because iterators cannot yield items that |
| borrow from the iterator itself, and because creating new allocations tends to |
| be a lot more convenient. |
| |
| If we're willing to forgo use of iterators, then we can amortize allocations |
| by creating a *single* `ByteRecord` and asking the CSV reader to read into it. |
| We do this by using the |
| [`Reader::read_byte_record`](../struct.Reader.html#method.read_byte_record) |
| method. |
| |
| ```no_run |
| //tutorial-perf-alloc-03.rs |
| # use std::{error::Error, io, process}; |
| # |
| fn run() -> Result<u64, Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| let mut record = csv::ByteRecord::new(); |
| |
| let mut count = 0; |
| while rdr.read_byte_record(&mut record)? { |
| if &record[0] == b"us" && &record[3] == b"MA" { |
| count += 1; |
| } |
| } |
| Ok(count) |
| } |
| # |
| # fn main() { |
| # match run() { |
| # Ok(count) => { |
| # println!("{}", count); |
| # } |
| # Err(err) => { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| # } |
| ``` |
| |
| Compile and run: |
| |
| ```text |
| $ cargo build --release |
| $ time ./target/release/csvtutor < worldcitiespop.csv |
| 2176 |
| |
| real 0m0.308s |
| user 0m0.283s |
| sys 0m0.023s |
| ``` |
| |
| Woohoo! This represents *another* 30% boost over the previous example, which is |
| a 50% boost over the first example. |
| |
| Let's dissect this code by taking a look at the type signature of the |
| `read_byte_record` method: |
| |
| ```ignore |
| fn read_byte_record(&mut self, record: &mut ByteRecord) -> csv::Result<bool>; |
| ``` |
| |
| This method takes as input a CSV reader (the `self` parameter) and a *mutable |
| borrow* of a `ByteRecord`, and returns a `csv::Result<bool>`. (The |
| `csv::Result<bool>` is equivalent to `Result<bool, csv::Error>`.) The return |
| value is `true` if and only if a record was read. When it's `false`, that means |
| the reader has exhausted its input. This method works by copying the contents |
| of the next record into the provided `ByteRecord`. Since the same `ByteRecord` |
| is used to read every record, it will already have space allocated for data. |
| When `read_byte_record` runs, it will overwrite the contents that were there |
| with the new record, which means that it can reuse the space that was |
| allocated. Thus, we have *amortized allocation*. |
| |
| An exercise you might consider doing is to use a `StringRecord` instead of a |
| `ByteRecord`, and therefore |
| [`Reader::read_record`](../struct.Reader.html#method.read_record) |
| instead of `read_byte_record`. This will give you easy access to Rust strings |
| at the cost of UTF-8 validation but *without* the cost of allocating a new |
| `StringRecord` for every record. |
| |
| ## Serde and zero allocation |
| |
| In this section, we are going to briefly examine how we use Serde and what we |
| can do to speed it up. The key optimization we'll want to make is to---you |
| guessed it---amortize allocation. |
| |
| As with the previous section, let's start with a simple baseline based off an |
| example using Serde in a previous section: |
| |
| ```no_run |
| //tutorial-perf-serde-01.rs |
| # #![allow(dead_code)] |
| use std::{error::Error, io, process}; |
| |
| use serde::Deserialize; |
| |
| #[derive(Debug, Deserialize)] |
| #[serde(rename_all = "PascalCase")] |
| struct Record { |
| country: String, |
| city: String, |
| accent_city: String, |
| region: String, |
| population: Option<u64>, |
| latitude: f64, |
| longitude: f64, |
| } |
| |
| fn run() -> Result<u64, Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| |
| let mut count = 0; |
| for result in rdr.deserialize() { |
| let record: Record = result?; |
| if record.country == "us" && record.region == "MA" { |
| count += 1; |
| } |
| } |
| Ok(count) |
| } |
| |
| fn main() { |
| match run() { |
| Ok(count) => { |
| println!("{}", count); |
| } |
| Err(err) => { |
| println!("{}", err); |
| process::exit(1); |
| } |
| } |
| } |
| ``` |
| |
| Now compile and run this program: |
| |
| ```text |
| $ cargo build --release |
| $ ./target/release/csvtutor < worldcitiespop.csv |
| 2176 |
| |
| real 0m1.381s |
| user 0m1.367s |
| sys 0m0.013s |
| ``` |
| |
| The first thing you might notice is that this is quite a bit slower than our |
| programs in the previous section. This is because deserializing each record |
| has a certain amount of overhead to it. In particular, some of the fields need |
| to be parsed as integers or floating point numbers, which isn't free. However, |
| there is hope yet, because we can speed up this program! |
| |
| Our first attempt to speed up the program will be to amortize allocation. Doing |
| this with Serde is a bit trickier than before, because we need to change our |
| `Record` type and use the manual deserialization API. Let's see what that looks |
| like: |
| |
| ```no_run |
| //tutorial-perf-serde-02.rs |
| # #![allow(dead_code)] |
| # use std::{error::Error, io, process}; |
| # use serde::Deserialize; |
| # |
| #[derive(Debug, Deserialize)] |
| #[serde(rename_all = "PascalCase")] |
| struct Record<'a> { |
| country: &'a str, |
| city: &'a str, |
| accent_city: &'a str, |
| region: &'a str, |
| population: Option<u64>, |
| latitude: f64, |
| longitude: f64, |
| } |
| |
| fn run() -> Result<u64, Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| let mut raw_record = csv::StringRecord::new(); |
| let headers = rdr.headers()?.clone(); |
| |
| let mut count = 0; |
| while rdr.read_record(&mut raw_record)? { |
| let record: Record = raw_record.deserialize(Some(&headers))?; |
| if record.country == "us" && record.region == "MA" { |
| count += 1; |
| } |
| } |
| Ok(count) |
| } |
| # |
| # fn main() { |
| # match run() { |
| # Ok(count) => { |
| # println!("{}", count); |
| # } |
| # Err(err) => { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| # } |
| ``` |
| |
| Compile and run: |
| |
| ```text |
| $ cargo build --release |
| $ ./target/release/csvtutor < worldcitiespop.csv |
| 2176 |
| |
| real 0m1.055s |
| user 0m1.040s |
| sys 0m0.013s |
| ``` |
| |
| This corresponds to an approximately 24% increase in performance. To achieve |
| this, we had to make two important changes. |
| |
| The first was to make our `Record` type contain `&str` fields instead of |
| `String` fields. If you recall from a previous section, `&str` is a *borrowed* |
| string where a `String` is an *owned* string. A borrowed string points to |
| a already existing allocation where as a `String` always implies a new |
| allocation. In this case, our `&str` is borrowing from the CSV record itself. |
| |
| The second change we had to make was to stop using the |
| [`Reader::deserialize`](../struct.Reader.html#method.deserialize) |
| iterator, and instead deserialize our record into a `StringRecord` explicitly |
| and then use the |
| [`StringRecord::deserialize`](../struct.StringRecord.html#method.deserialize) |
| method to deserialize a single record. |
| |
| The second change is a bit tricky, because in order for it to work, our |
| `Record` type needs to borrow from the data inside the `StringRecord`. That |
| means that our `Record` value cannot outlive the `StringRecord` that it was |
| created from. Since we overwrite the same `StringRecord` on each iteration |
| (in order to amortize allocation), that means our `Record` value must evaporate |
| before the next iteration of the loop. Indeed, the compiler will enforce this! |
| |
| There is one more optimization we can make: remove UTF-8 validation. In |
| general, this means using `&[u8]` instead of `&str` and `ByteRecord` instead |
| of `StringRecord`: |
| |
| ```no_run |
| //tutorial-perf-serde-03.rs |
| # #![allow(dead_code)] |
| # use std::{error::Error, io, process}; |
| # |
| # use serde::Deserialize; |
| # |
| #[derive(Debug, Deserialize)] |
| #[serde(rename_all = "PascalCase")] |
| struct Record<'a> { |
| country: &'a [u8], |
| city: &'a [u8], |
| accent_city: &'a [u8], |
| region: &'a [u8], |
| population: Option<u64>, |
| latitude: f64, |
| longitude: f64, |
| } |
| |
| fn run() -> Result<u64, Box<dyn Error>> { |
| let mut rdr = csv::Reader::from_reader(io::stdin()); |
| let mut raw_record = csv::ByteRecord::new(); |
| let headers = rdr.byte_headers()?.clone(); |
| |
| let mut count = 0; |
| while rdr.read_byte_record(&mut raw_record)? { |
| let record: Record = raw_record.deserialize(Some(&headers))?; |
| if record.country == b"us" && record.region == b"MA" { |
| count += 1; |
| } |
| } |
| Ok(count) |
| } |
| # |
| # fn main() { |
| # match run() { |
| # Ok(count) => { |
| # println!("{}", count); |
| # } |
| # Err(err) => { |
| # println!("{}", err); |
| # process::exit(1); |
| # } |
| # } |
| # } |
| ``` |
| |
| Compile and run: |
| |
| ```text |
| $ cargo build --release |
| $ ./target/release/csvtutor < worldcitiespop.csv |
| 2176 |
| |
| real 0m0.873s |
| user 0m0.850s |
| sys 0m0.023s |
| ``` |
| |
| This corresponds to a 17% increase over the previous example and a 37% increase |
| over the first example. |
| |
| In sum, Serde parsing is still quite fast, but will generally not be the |
| fastest way to parse CSV since it necessarily needs to do more work. |
| |
| ## CSV parsing without the standard library |
| |
| In this section, we will explore a niche use case: parsing CSV without the |
| standard library. While the `csv` crate itself requires the standard library, |
| the underlying parser is actually part of the |
| [`csv-core`](https://docs.rs/csv-core) |
| crate, which does not depend on the standard library. The downside of not |
| depending on the standard library is that CSV parsing becomes a lot more |
| inconvenient. |
| |
| The `csv-core` crate is structured similarly to the `csv` crate. There is a |
| [`Reader`](../../csv_core/struct.Reader.html) |
| and a |
| [`Writer`](../../csv_core/struct.Writer.html), |
| as well as corresponding builders |
| [`ReaderBuilder`](../../csv_core/struct.ReaderBuilder.html) |
| and |
| [`WriterBuilder`](../../csv_core/struct.WriterBuilder.html). |
| The `csv-core` crate has no record types or iterators. Instead, CSV data |
| can either be read one field at a time or one record at a time. In this |
| section, we'll focus on reading a field at a time since it is simpler, but it |
| is generally faster to read a record at a time since it does more work per |
| function call. |
| |
| In keeping with this section on performance, let's write a program using only |
| `csv-core` that counts the number of records in the state of Massachusetts. |
| |
| (Note that we unfortunately use the standard library in this example even |
| though `csv-core` doesn't technically require it. We do this for convenient |
| access to I/O, which would be harder without the standard library.) |
| |
| ```no_run |
| //tutorial-perf-core-01.rs |
| use std::io::{self, Read}; |
| use std::process; |
| |
| use csv_core::{Reader, ReadFieldResult}; |
| |
| fn run(mut data: &[u8]) -> Option<u64> { |
| let mut rdr = Reader::new(); |
| |
| // Count the number of records in Massachusetts. |
| let mut count = 0; |
| // Indicates the current field index. Reset to 0 at start of each record. |
| let mut fieldidx = 0; |
| // True when the current record is in the United States. |
| let mut inus = false; |
| // Buffer for field data. Must be big enough to hold the largest field. |
| let mut field = [0; 1024]; |
| loop { |
| // Attempt to incrementally read the next CSV field. |
| let (result, nread, nwrite) = rdr.read_field(data, &mut field); |
| // nread is the number of bytes read from our input. We should never |
| // pass those bytes to read_field again. |
| data = &data[nread..]; |
| // nwrite is the number of bytes written to the output buffer `field`. |
| // The contents of the buffer after this point is unspecified. |
| let field = &field[..nwrite]; |
| |
| match result { |
| // We don't need to handle this case because we read all of the |
| // data up front. If we were reading data incrementally, then this |
| // would be a signal to read more. |
| ReadFieldResult::InputEmpty => {} |
| // If we get this case, then we found a field that contains more |
| // than 1024 bytes. We keep this example simple and just fail. |
| ReadFieldResult::OutputFull => { |
| return None; |
| } |
| // This case happens when we've successfully read a field. If the |
| // field is the last field in a record, then `record_end` is true. |
| ReadFieldResult::Field { record_end } => { |
| if fieldidx == 0 && field == b"us" { |
| inus = true; |
| } else if inus && fieldidx == 3 && field == b"MA" { |
| count += 1; |
| } |
| if record_end { |
| fieldidx = 0; |
| inus = false; |
| } else { |
| fieldidx += 1; |
| } |
| } |
| // This case happens when the CSV reader has successfully exhausted |
| // all input. |
| ReadFieldResult::End => { |
| break; |
| } |
| } |
| } |
| Some(count) |
| } |
| |
| fn main() { |
| // Read the entire contents of stdin up front. |
| let mut data = vec![]; |
| if let Err(err) = io::stdin().read_to_end(&mut data) { |
| println!("{}", err); |
| process::exit(1); |
| } |
| match run(&data) { |
| None => { |
| println!("error: could not count records, buffer too small"); |
| process::exit(1); |
| } |
| Some(count) => { |
| println!("{}", count); |
| } |
| } |
| } |
| ``` |
| |
| And compile and run it: |
| |
| ```text |
| $ cargo build --release |
| $ time ./target/release/csvtutor < worldcitiespop.csv |
| 2176 |
| |
| real 0m0.572s |
| user 0m0.513s |
| sys 0m0.057s |
| ``` |
| |
| This isn't as fast as some of our previous examples where we used the `csv` |
| crate to read into a `StringRecord` or a `ByteRecord`. This is mostly because |
| this example reads a field at a time, which incurs more overhead than reading a |
| record at a time. To fix this, you would want to use the |
| [`Reader::read_record`](../../csv_core/struct.Reader.html#method.read_record) |
| method instead, which is defined on `csv_core::Reader`. |
| |
| The other thing to notice here is that the example is considerably longer than |
| the other examples. This is because we need to do more book keeping to keep |
| track of which field we're reading and how much data we've already fed to the |
| reader. There are basically two reasons to use the `csv_core` crate: |
| |
| 1. If you're in an environment where the standard library is not usable. |
| 2. If you wanted to build your own csv-like library, you could build it on top |
| of `csv-core`. |
| |
| # Closing thoughts |
| |
| Congratulations on making it to the end! It seems incredible that one could |
| write so many words on something as basic as CSV parsing. I wanted this |
| guide to be accessible not only to Rust beginners, but to inexperienced |
| programmers as well. My hope is that the large number of examples will help |
| push you in the right direction. |
| |
| With that said, here are a few more things you might want to look at: |
| |
| * The [API documentation for the `csv` crate](../index.html) documents all |
| facets of the library, and is itself littered with even more examples. |
| * The [`csv-index` crate](https://docs.rs/csv-index) provides data structures |
| that can index CSV data that are amenable to writing to disk. (This library |
| is still a work in progress.) |
| * The [`xsv` command line tool](https://github.com/BurntSushi/xsv) is a high |
| performance CSV swiss army knife. It can slice, select, search, sort, join, |
| concatenate, index, format and compute statistics on arbitrary CSV data. Give |
| it a try! |
| |
| */ |