Skip to content

Schema

Delta Lake Schemas

Schemas, fields, and data types are provided in the deltalake.schema submodule.

deltalake.schema.Schema

Schema(fields: List[Field])

Bases: deltalake._internal.StructType

A Delta Lake schema

Create using a list of :class:Field:

Schema([Field("x", "integer"), Field("y", "string")]) Schema([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])

Or create from a PyArrow schema:

import pyarrow as pa Schema.from_pyarrow(pa.schema({"x": pa.int32(), "y": pa.string()})) Schema([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])

invariants

invariants: List[Tuple[str, str]] = <attribute 'invariants' of 'deltalake._internal.Schema' objects>

The list of invariants on the table. Each invarint is a tuple of strings. The first string is the field path and the second is the SQL of the invariant.

from_json staticmethod

from_json(schema_json) -> Schema

Create a new Schema from a JSON string.

A schema has the same JSON format as a StructType.

Schema.from_json('''{
    "type": "struct",
    "fields": [{"name": "x", "type": "integer", "nullable": true, "metadata": {}}]
    }
)'''
# Returns Schema([Field(x, PrimitiveType("integer"), nullable=True)])

Parameters:

Name Type Description Default
json str

a JSON string

required

from_pyarrow staticmethod

from_pyarrow(data_type) -> Schema

Create a Schema from a PyArrow Schema type

Will raise TypeError if the PyArrow type is not a primitive type.

Parameters:

Name Type Description Default
type Schema

A PyArrow Schema type

required

Returns: a Schema type

to_json method descriptor

to_json() -> str

Get the JSON string representation of the Schema. A schema has the same JSON format as a StructType.

Schema([Field("x", "integer")]).to_json()
# Returns '{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}}]}'
Returns: a JSON string

to_pyarrow method descriptor

to_pyarrow(as_large_types: bool = False) -> pyarrow.Schema

Return equivalent PyArrow schema

Parameters:

Name Type Description Default
as_large_types bool

get schema with all variable size types (list, binary, string) as large variants (with int64 indices). This is for compatibility with systems like Polars that only support the large versions of Arrow types.

False

Returns:

Type Description
Schema

a PyArrow Schema type

deltalake.schema.PrimitiveType

PrimitiveType(data_type: str)

A primitive datatype, such as a string or number.

Can be initialized with a string value:

PrimitiveType("integer")

Valid primitive data types include:

  • "string",
  • "long",
  • "integer",
  • "short",
  • "byte",
  • "float",
  • "double",
  • "boolean",
  • "binary",
  • "date",
  • "timestamp",
  • "decimal(, )"

Parameters:

Name Type Description Default
data_type str

string representation of the data type

required

type

type: str = <attribute 'type' of 'deltalake._internal.PrimitiveType' objects>

The inner type

from_json staticmethod

from_json(type_json) -> PrimitiveType

Create a PrimitiveType from a JSON string

The JSON representation for a primitive type is just a quoted string: PrimitiveType.from_json('"integer"')

Parameters:

Name Type Description Default
json str

A JSON string

required

Returns a PrimitiveType type

from_pyarrow staticmethod

from_pyarrow(data_type) -> PrimitiveType

Create a PrimitiveType from a PyArrow type

Will raise TypeError if the PyArrow type is not a primitive type.

Parameters:

Name Type Description Default
type DataType

A PyArrow DataType type

required

Returns: a PrimitiveType type

to_pyarrow method descriptor

to_pyarrow() -> pyarrow.DataType

Get the equivalent PyArrow type (pyarrow.DataType)

deltalake.schema.ArrayType

ArrayType(
    element_type: DataType, *, contains_null: bool = True
)

An Array (List) DataType

Can either pass the element type explicitly or can pass a string if it is a primitive type:

ArrayType(PrimitiveType("integer"))
# Returns ArrayType(PrimitiveType("integer"), contains_null=True)

ArrayType("integer", contains_null=False)
# Returns ArrayType(PrimitiveType("integer"), contains_null=False)

contains_null

contains_null: bool = <attribute 'contains_null' of 'deltalake._internal.ArrayType' objects>

Whether the arrays may contain null values

element_type

element_type: DataType = <attribute 'element_type' of 'deltalake._internal.ArrayType' objects>

The type of the element, of type: Union[ PrimitiveType, ArrayType, MapType, StructType ]

type

type: Literal['array'] = <attribute 'type' of 'deltalake._internal.ArrayType' objects>

The string "array"

from_json staticmethod

from_json(type_json) -> ArrayType

Create an ArrayType from a JSON string

The JSON representation for an array type is an object with type (set to "array"), elementType, and containsNull:

ArrayType.from_json(
    '''{
        "type": "array",
        "elementType": "integer",
        "containsNull": false
    }'''
)
# Returns ArrayType(PrimitiveType("integer"), contains_null=False)

Parameters:

Name Type Description Default
json str

A JSON string

required

Returns: an ArrayType type

from_pyarrow staticmethod

from_pyarrow(data_type) -> ArrayType

Create an ArrayType from a pyarrow.ListType.

Will raise TypeError if a different PyArrow DataType is provided.

Parameters:

Name Type Description Default
type ListType

The PyArrow ListType

required

Returns: an ArrayType type

to_json method descriptor

to_json() -> str

Get the JSON string representation of the type.

to_pyarrow method descriptor

to_pyarrow() -> pyarrow.ListType

Get the equivalent PyArrow type.

deltalake.schema.MapType

MapType(
    key_type: DataType,
    value_type: DataType,
    *,
    value_contains_null: bool = True
)

A map data type

key_type and value_type should be PrimitiveType, ArrayType, or StructType. A string can also be passed, which will be parsed as a primitive type:

MapType(PrimitiveType("integer"), PrimitiveType("string"))
# Returns MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=True)

MapType("integer", "string", value_contains_null=False)
# Returns MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=False)

key_type

key_type: DataType = <attribute 'key_type' of 'deltalake._internal.MapType' objects>

The type of the keys, of type: Union[ PrimitiveType, ArrayType, MapType, StructType ]

value_contains_null

value_contains_null: bool = <attribute 'value_contains_null' of 'deltalake._internal.MapType' objects>

Whether the values in a map may be null

value_type

value_type: DataType = <attribute 'value_type' of 'deltalake._internal.MapType' objects>

The type of the values, of type: Union[ PrimitiveType, ArrayType, MapType, StructType ]

from_json staticmethod

from_json(type_json) -> MapType

Create a MapType from a JSON string

The JSON representation for a map type is an object with type (set to map), keyType, valueType, and valueContainsNull:

MapType.from_json(
    '''{
        "type": "map",
        "keyType": "integer",
        "valueType": "string",
        "valueContainsNull": true
    }'''
)
# Returns MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=True)

Parameters:

Name Type Description Default
json str

A JSON string

required

Returns: a MapType type

from_pyarrow staticmethod

from_pyarrow(data_type) -> MapType

Create a MapType from a PyArrow MapType.

Will raise TypeError if passed a different type.

Parameters:

Name Type Description Default
type MapType

the PyArrow MapType

required

Returns: a MapType type

to_json method descriptor

to_json() -> str

Get JSON string representation of map type.

to_pyarrow method descriptor

to_pyarrow() -> pyarrow.MapType

Get the equivalent PyArrow data type.

deltalake.schema.Field

Field(
    name: str,
    type: DataType,
    *,
    nullable: bool = True,
    metadata: Optional[Dict[str, Any]] = None
)

A field in a Delta StructType or Schema

Can create with just a name and a type:

Field("my_int_col", "integer")
# Returns Field("my_int_col", PrimitiveType("integer"), nullable=True, metadata=None)

Can also attach metadata to the field. Metadata should be a dictionary with string keys and JSON-serializable values (str, list, int, float, dict):

Field("my_col", "integer", metadata={"custom_metadata": {"test": 2}})
# Returns Field("my_col", PrimitiveType("integer"), nullable=True, metadata={"custom_metadata": {"test": 2}})

metadata

metadata: Dict[str, Any] = <attribute 'metadata' of 'deltalake._internal.Field' objects>

The metadata of the field

name

name: str = <attribute 'name' of 'deltalake._internal.Field' objects>

The name of the field

nullable

nullable: bool = <attribute 'nullable' of 'deltalake._internal.Field' objects>

Whether there may be null values in the field

type

type: DataType = <attribute 'type' of 'deltalake._internal.Field' objects>

The type of the field, of type: Union[ PrimitiveType, ArrayType, MapType, StructType ]

from_json staticmethod

from_json(field_json) -> Field

Create a Field from a JSON string.

Parameters:

Name Type Description Default
json str

the JSON string.

required

Returns: Field

Example:

Field.from_json('''{
        "name": "col",
        "type": "integer",
        "nullable": true,
        "metadata": {}
    }'''
)
# Returns Field(col, PrimitiveType("integer"), nullable=True)

from_pyarrow staticmethod

from_pyarrow(field: pyarrow.Field) -> Field

Create a Field from a PyArrow field Note: This currently doesn't preserve field metadata.

Parameters:

Name Type Description Default
field Field

a PyArrow Field type

required

Returns: a Field type

to_json method descriptor

to_json() -> str

Get the field as JSON string.

Field("col", "integer").to_json()
# Returns '{"name":"col","type":"integer","nullable":true,"metadata":{}}'

to_pyarrow method descriptor

to_pyarrow() -> pyarrow.Field

Convert to an equivalent PyArrow field Note: This currently doesn't preserve field metadata.

Returns: a pyarrow.Field type

deltalake.schema.StructType

StructType(fields: List[Field])

A struct datatype, containing one or more subfields

Example:

Create with a list of :class:Field:

StructType([Field("x", "integer"), Field("y", "string")])
# Creates: StructType([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])

fields

fields: List[Field] = <attribute 'fields' of 'deltalake._internal.StructType' objects>

The fields within the struct

type

type: Literal['struct'] = <attribute 'type' of 'deltalake._internal.StructType' objects>

The string "struct"

from_json staticmethod

from_json(type_json) -> StructType

Create a new StructType from a JSON string.

StructType.from_json(
    '''{
        "type": "struct",
        "fields": [{"name": "x", "type": "integer", "nullable": true, "metadata": {}}]
    }'''
)
# Returns StructType([Field(x, PrimitiveType("integer"), nullable=True)])

Parameters:

Name Type Description Default
json str

a JSON string

required

Returns: a StructType type

from_pyarrow staticmethod

from_pyarrow(data_type) -> StructType

Create a new StructType from a PyArrow struct type.

Will raise TypeError if a different data type is provided.

Parameters:

Name Type Description Default
type StructType

a PyArrow struct type.

required

Returns: a StructType type

to_json method descriptor

to_json() -> str

Get the JSON representation of the type.

StructType([Field("x", "integer")]).to_json()
# Returns '{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}}]}'

to_pyarrow method descriptor

to_pyarrow() -> pyarrow.StructType

Get the equivalent PyArrow StructType

Returns: a PyArrow StructType type

DataCatalog

Bases: Enum

List of the Data Catalogs

AWS class-attribute instance-attribute

AWS = 'glue'

Refers to the AWS Glue Data Catalog <https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html>_

UNITY class-attribute instance-attribute

UNITY = 'unity'

Refers to the Databricks Unity Catalog <https://docs.databricks.com/data-governance/unity-catalog/index.html>_

Delta Storage Handler

DeltaStorageHandler

DeltaStorageHandler(
    root: str,
    options: dict[str, str] | None = None,
    known_sizes: dict[str, int] | None = None,
)

Bases: DeltaFileSystemHandler, FileSystemHandler

DeltaStorageHandler is a concrete implementations of a PyArrow FileSystemHandler.

get_file_info_selector

get_file_info_selector(
    selector: FileSelector,
) -> List[FileInfo]

Get info for the files defined by FileSelector.

Parameters:

Name Type Description Default
selector FileSelector

FileSelector object

required

Returns:

Type Description
List[FileInfo]

list of file info objects

open_input_file

open_input_file(path: str) -> pa.PythonFile

Open an input file for random access reading.

Parameters:

Name Type Description Default
path str

The source to open for reading.

required

Returns:

Type Description
PythonFile

NativeFile

open_input_stream

open_input_stream(path: str) -> pa.PythonFile

Open an input stream for sequential reading.

Parameters:

Name Type Description Default
path str

The source to open for reading.

required

Returns:

Type Description
PythonFile

NativeFile

open_output_stream

open_output_stream(
    path: str, metadata: Optional[Dict[str, str]] = None
) -> pa.PythonFile

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

Parameters:

Name Type Description Default
path str

The source to open for writing.

required
metadata Optional[Dict[str, str]]

If not None, a mapping of string keys to string values.

None

Returns:

Type Description
PythonFile

NativeFile