zlib and bzlib package updates

Duncan Coutts – Sunday, 02 November 2008

all coding

I'm pleased to announce updates to the zlib and bzlib packages.

The releases are on Hackage:

What's new

What's new in these releases is that the extended API is slightly nicer. The simple API that most packages use is unchanged.

In particular, these functions have different types:

compressWith   :: CompressParams
               -> ByteString -> ByteString
decompressWith :: DecompressParams
               -> ByteString -> ByteString

The CompressParams and DecompressParams types are records of compression/decompression parameters. The functions are used like so:

compressWith   defaultCompressParams { ... }
decompressWith defaultDecompressParams { ... }

There is also a new parameter to control the size of the first output buffer. This lets applications save memory when they happen to have a good estimate of the output size (some apps like darcs know this exactly). By getting a good estimate and (de)compressing into a single-chunk lazy bytestring this lets apps convert to a strict bytestring with no extra copying cost.

Future directions

The simple API is very unlikely to change.

The current error handling for decompression is not ideal. It just throws exceptions for failures like bad format or unexpected end of stream. This is a tricky area because error streaming behaviour does not mix easily with error handling.

On option which I use in the iconv library is to have a data type describe the real error conditions, something like:

data DataStream
   = Chunk Strict.ByteString Checksum DataStream
   | Error Error -- for some suitable error type
   | End Checksum

With suitable fold functions and functions to convert to a lazy ByteString. Then people who care about error handling and streaming behaviour can use that type directly. For example it should be trivial to convert to an iterator style.

People have also asked for a continuation style api to give more control over dynamic behaviour like flushing the compression state (eg in a http server). Unfortunately this does not look easy. The zlib state is mutable and while this can be hidden in a lazy list, it cannot be hidden if we provide access to intermediate continuations. That is because those continuations can be re-run whereas a lazy list evaluates each element at most once (and with suitable internal locking this is even true for SMP).

Background

The zlib and bzlib packages provide functions for compression and decompression in the gzip, zlib and bzip2 formats. Both provide pure functions on streams of data represented by lazy ByteStrings:

compress, decompress :: ByteString -> ByteString

This makes it easy to use either in memory or with disk or network IO. For example a simple gzip compression program is just:

import qualified Data.ByteString.Lazy as BS
import qualified Codec.Compression.GZip as GZip

main = BS.interact GZip.compress

Or you could lazily read in and decompress .gz file using:

content <- GZip.decompress <$> BS.readFile file

General

Both packages are bindings to the corresponding C libs, so they depend on those external C libraries (except on Windows where we build a bundled copy of the C lib source code). The compression speed is as you would expect since it's the C lib that is doing all the work.

The zlib package is used in cabal-install to work with .tar.gz files. So it has actually been tested on Windows. It works with all versions of ghc since 6.4 (though it requires Cabal-1.2).

The darcs repos for the development versions live on code.haskell.org:

darcs get http://code.haskell.org/zlib
darcs get http://code.haskell.org/bzlib

I'm very happy to get feedback on the API, the documentation or of course any bug reports.