[−][src]Crate parquet

Apache Parquet is a columnar storage format that provides efficient data compression and encoding schemes to improve performance of handling complex nested data structures. Parquet implements record-shredding and assembly algorithm described in the Dremel paper.

Crate provides API to access file schema and metadata from a Parquet file, extract row groups or column chunks from a file, read and write records/values.

Usage

See the link crates.io/crates/parquet for the latest version of the crate.

Add parquet to the list of dependencies in Cargo.toml and this to the project's crate root:

extern crate parquet;

Example

Import file reader to get access to Parquet metadata, including the file schema.

#![feature(try_from)]

use parquet::file::reader::{FileReader, SerializedFileReader};
use std::convert::TryFrom;

let reader = SerializedFileReader::try_from("data/alltypes_plain.parquet").unwrap();

let parquet_metadata = reader.metadata();
assert_eq!(parquet_metadata.num_row_groups(), 1);

let file_metadata = parquet_metadata.file_metadata();
assert_eq!(file_metadata.num_rows(), 8);

let schema = file_metadata.schema();
assert_eq!(schema.get_fields().len(), 11);

Crate provides several read and write API options. Below is an example of using the record reader API.

#![feature(try_from)]

use parquet::file::reader::{FileReader, SerializedFileReader};
use std::convert::TryFrom;

let reader = SerializedFileReader::try_from("data/alltypes_plain.parquet").unwrap();

// Reading data using record API with optional projection schema.
let mut iter = reader.get_row_iter(None).unwrap();
while let Some(record) = iter.next() {
  // See record API for different field accessors
  println!("{}", record);
}

Metadata

Module metadata contains Parquet metadata structs, including file metadata, that has information about file schema, version, and number of rows, row group metadata with a set of column chunks that contain column type and encodings, number of values and compressed/uncompressed size in bytes.

Statistics

Statistics are optional, and provide min/max values, null count, etc. for each column or data page, from which they could be accessed respectively, and are described in statistics module.

Schema and type

Parquet schema can be extracted from FileMetaData and is represented by Parquet type.

Parquet type is described by Type, including top level message type (schema). Refer to the schema module for the detailed information on Type API, printing and parsing of message types.

File and row group API

Module file contains all definitions to explore Parquet files metadata and data. File reader FileReader is a starting point for working with Parquet files - it provides set of methods to get file metadata, row group readers RowGroupReader to get access to column readers and record iterator.

Read API

Crate offers several methods to read data from a Parquet file:

Low level column reader API (see file and column modules)
Arrow API (TODO)
High level record API (see record module)

Write API

Crate also provides API to write data in Parquet format:

Low level column writer API (see file and column modules)
Arrow API (TODO)
High level API for writing records (TODO)

Modules

basic	Contains Rust mappings for Thrift definition. Refer to `parquet.thrift` file to see raw definitions.
column	Low level column reader and writer APIs.
compression	Contains codec interface and supported codec implementations.
data_type	Data types that connect Parquet physical types with their Rust-specific representations.
decoding	Contains all supported decoders for Parquet.
encoding	Contains all supported encoders for Parquet.
errors	Common Parquet errors and macros.
file	Main entrypoint for working with Parquet API.
memory	Utility methods and structs for working with memory.
record	Contains record-based API for reading Parquet files.
schema	Parquet schema definitions and methods to print and parse schema.