Use the slice-by-four algorithm for CRC64. Compared to the C version, this reads the input byte by byte instead of four aligned bytes at a time. This is still faster than the previous simpler version. This code was adapted from XZ Utils by Brett Okken. He also did benchmarking. Thanks!