Schema
Delta Lake Schemas
Schemas, fields, and data types are provided in the deltalake.schema submodule.
deltalake.schema.Schema
Schema(fields: List[Field])
Bases: deltalake._internal.StructType
A Delta Lake schema
Create using a list of :class:Field:
Schema([Field("x", "integer"), Field("y", "string")]) Schema([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])
Or create from a PyArrow schema:
import pyarrow as pa Schema.from_pyarrow(pa.schema({"x": pa.int32(), "y": pa.string()})) Schema([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])
invariants
invariants: List[Tuple[str, str]] = <attribute 'invariants' of 'deltalake._internal.Schema' objects>
The list of invariants on the table. Each invarint is a tuple of strings. The first string is the field path and the second is the SQL of the invariant.
from_json
staticmethod
from_json(schema_json) -> Schema
Create a new Schema from a JSON string.
A schema has the same JSON format as a StructType.
Schema.from_json('''{
"type": "struct",
"fields": [{"name": "x", "type": "integer", "nullable": true, "metadata": {}}]
}
)'''
# Returns Schema([Field(x, PrimitiveType("integer"), nullable=True)])
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
json |
str
|
a JSON string |
required |
from_pyarrow
staticmethod
from_pyarrow(data_type) -> Schema
to_json
method descriptor
to_json() -> str
Get the JSON string representation of the Schema. A schema has the same JSON format as a StructType.
Schema([Field("x", "integer")]).to_json()
# Returns '{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}}]}'
to_pyarrow
method descriptor
to_pyarrow(as_large_types: bool = False) -> pyarrow.Schema
Return equivalent PyArrow schema
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
as_large_types |
bool
|
get schema with all variable size types (list, binary, string) as large variants (with int64 indices). This is for compatibility with systems like Polars that only support the large versions of Arrow types. |
False
|
Returns:
| Type | Description |
|---|---|
Schema
|
a PyArrow Schema type |
deltalake.schema.PrimitiveType
PrimitiveType(data_type: str)
A primitive datatype, such as a string or number.
Can be initialized with a string value:
PrimitiveType("integer")
Valid primitive data types include:
- "string",
- "long",
- "integer",
- "short",
- "byte",
- "float",
- "double",
- "boolean",
- "binary",
- "date",
- "timestamp",
- "decimal(
, )"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_type |
str
|
string representation of the data type |
required |
type
type: str = <attribute 'type' of 'deltalake._internal.PrimitiveType' objects>
The inner type
from_json
staticmethod
from_json(type_json) -> PrimitiveType
Create a PrimitiveType from a JSON string
The JSON representation for a primitive type is just a quoted string: PrimitiveType.from_json('"integer"')
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
json |
str
|
A JSON string |
required |
Returns a PrimitiveType type
from_pyarrow
staticmethod
from_pyarrow(data_type) -> PrimitiveType
Create a PrimitiveType from a PyArrow type
Will raise TypeError if the PyArrow type is not a primitive type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
type |
DataType
|
A PyArrow DataType type |
required |
Returns: a PrimitiveType type
to_pyarrow
method descriptor
to_pyarrow() -> pyarrow.DataType
Get the equivalent PyArrow type (pyarrow.DataType)
deltalake.schema.ArrayType
ArrayType(
element_type: DataType, *, contains_null: bool = True
)
An Array (List) DataType
Can either pass the element type explicitly or can pass a string if it is a primitive type:
ArrayType(PrimitiveType("integer"))
# Returns ArrayType(PrimitiveType("integer"), contains_null=True)
ArrayType("integer", contains_null=False)
# Returns ArrayType(PrimitiveType("integer"), contains_null=False)
contains_null
contains_null: bool = <attribute 'contains_null' of 'deltalake._internal.ArrayType' objects>
Whether the arrays may contain null values
element_type
element_type: DataType = <attribute 'element_type' of 'deltalake._internal.ArrayType' objects>
The type of the element, of type: Union[ PrimitiveType, ArrayType, MapType, StructType ]
type
type: Literal['array'] = <attribute 'type' of 'deltalake._internal.ArrayType' objects>
The string "array"
from_json
staticmethod
from_json(type_json) -> ArrayType
Create an ArrayType from a JSON string
The JSON representation for an array type is an object with type (set to
"array"), elementType, and containsNull:
ArrayType.from_json(
'''{
"type": "array",
"elementType": "integer",
"containsNull": false
}'''
)
# Returns ArrayType(PrimitiveType("integer"), contains_null=False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
json |
str
|
A JSON string |
required |
Returns: an ArrayType type
from_pyarrow
staticmethod
from_pyarrow(data_type) -> ArrayType
to_json
method descriptor
to_json() -> str
Get the JSON string representation of the type.
to_pyarrow
method descriptor
to_pyarrow() -> pyarrow.ListType
Get the equivalent PyArrow type.
deltalake.schema.MapType
MapType(
key_type: DataType,
value_type: DataType,
*,
value_contains_null: bool = True
)
A map data type
key_type and value_type should be PrimitiveType, ArrayType,
or StructType. A string can also be passed, which will be
parsed as a primitive type:
MapType(PrimitiveType("integer"), PrimitiveType("string"))
# Returns MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=True)
MapType("integer", "string", value_contains_null=False)
# Returns MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=False)
key_type
key_type: DataType = <attribute 'key_type' of 'deltalake._internal.MapType' objects>
The type of the keys, of type: Union[ PrimitiveType, ArrayType, MapType, StructType ]
value_contains_null
value_contains_null: bool = <attribute 'value_contains_null' of 'deltalake._internal.MapType' objects>
Whether the values in a map may be null
value_type
value_type: DataType = <attribute 'value_type' of 'deltalake._internal.MapType' objects>
The type of the values, of type: Union[ PrimitiveType, ArrayType, MapType, StructType ]
from_json
staticmethod
from_json(type_json) -> MapType
Create a MapType from a JSON string
The JSON representation for a map type is an object with type (set to map),
keyType, valueType, and valueContainsNull:
MapType.from_json(
'''{
"type": "map",
"keyType": "integer",
"valueType": "string",
"valueContainsNull": true
}'''
)
# Returns MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=True)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
json |
str
|
A JSON string |
required |
Returns: a MapType type
from_pyarrow
staticmethod
from_pyarrow(data_type) -> MapType
to_json
method descriptor
to_json() -> str
Get JSON string representation of map type.
to_pyarrow
method descriptor
to_pyarrow() -> pyarrow.MapType
Get the equivalent PyArrow data type.
deltalake.schema.Field
Field(
name: str,
type: DataType,
*,
nullable: bool = True,
metadata: Optional[Dict[str, Any]] = None
)
A field in a Delta StructType or Schema
Can create with just a name and a type:
Field("my_int_col", "integer")
# Returns Field("my_int_col", PrimitiveType("integer"), nullable=True, metadata=None)
Can also attach metadata to the field. Metadata should be a dictionary with string keys and JSON-serializable values (str, list, int, float, dict):
Field("my_col", "integer", metadata={"custom_metadata": {"test": 2}})
# Returns Field("my_col", PrimitiveType("integer"), nullable=True, metadata={"custom_metadata": {"test": 2}})
metadata
metadata: Dict[str, Any] = <attribute 'metadata' of 'deltalake._internal.Field' objects>
The metadata of the field
name
name: str = <attribute 'name' of 'deltalake._internal.Field' objects>
The name of the field
nullable
nullable: bool = <attribute 'nullable' of 'deltalake._internal.Field' objects>
Whether there may be null values in the field
type
type: DataType = <attribute 'type' of 'deltalake._internal.Field' objects>
The type of the field, of type: Union[ PrimitiveType, ArrayType, MapType, StructType ]
from_json
staticmethod
from_json(field_json) -> Field
Create a Field from a JSON string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
json |
str
|
the JSON string. |
required |
Returns: Field
Example:
Field.from_json('''{
"name": "col",
"type": "integer",
"nullable": true,
"metadata": {}
}'''
)
# Returns Field(col, PrimitiveType("integer"), nullable=True)
from_pyarrow
staticmethod
from_pyarrow(field: pyarrow.Field) -> Field
to_json
method descriptor
to_json() -> str
Get the field as JSON string.
Field("col", "integer").to_json()
# Returns '{"name":"col","type":"integer","nullable":true,"metadata":{}}'
to_pyarrow
method descriptor
to_pyarrow() -> pyarrow.Field
Convert to an equivalent PyArrow field Note: This currently doesn't preserve field metadata.
Returns: a pyarrow.Field type
deltalake.schema.StructType
StructType(fields: List[Field])
A struct datatype, containing one or more subfields
Example:
Create with a list of :class:Field:
StructType([Field("x", "integer"), Field("y", "string")])
# Creates: StructType([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])
fields
fields: List[Field] = <attribute 'fields' of 'deltalake._internal.StructType' objects>
The fields within the struct
type
type: Literal['struct'] = <attribute 'type' of 'deltalake._internal.StructType' objects>
The string "struct"
from_json
staticmethod
from_json(type_json) -> StructType
Create a new StructType from a JSON string.
StructType.from_json(
'''{
"type": "struct",
"fields": [{"name": "x", "type": "integer", "nullable": true, "metadata": {}}]
}'''
)
# Returns StructType([Field(x, PrimitiveType("integer"), nullable=True)])
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
json |
str
|
a JSON string |
required |
Returns: a StructType type
from_pyarrow
staticmethod
from_pyarrow(data_type) -> StructType
Create a new StructType from a PyArrow struct type.
Will raise TypeError if a different data type is provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
type |
StructType
|
a PyArrow struct type. |
required |
Returns: a StructType type
to_json
method descriptor
to_json() -> str
Get the JSON representation of the type.
StructType([Field("x", "integer")]).to_json()
# Returns '{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}}]}'
to_pyarrow
method descriptor
to_pyarrow() -> pyarrow.StructType
Get the equivalent PyArrow StructType
Returns: a PyArrow StructType type
DataCatalog
Bases: Enum
List of the Data Catalogs
AWS
class-attribute
instance-attribute
AWS = 'glue'
Refers to the
AWS Glue Data Catalog <https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html>_
UNITY
class-attribute
instance-attribute
UNITY = 'unity'
Refers to the
Databricks Unity Catalog <https://docs.databricks.com/data-governance/unity-catalog/index.html>_
Delta Storage Handler
DeltaStorageHandler
DeltaStorageHandler(
root: str,
options: dict[str, str] | None = None,
known_sizes: dict[str, int] | None = None,
)
Bases: DeltaFileSystemHandler, FileSystemHandler
DeltaStorageHandler is a concrete implementations of a PyArrow FileSystemHandler.
get_file_info_selector
get_file_info_selector(
selector: FileSelector,
) -> List[FileInfo]
Get info for the files defined by FileSelector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
selector |
FileSelector
|
FileSelector object |
required |
Returns:
| Type | Description |
|---|---|
List[FileInfo]
|
list of file info objects |
open_input_file
open_input_file(path: str) -> pa.PythonFile
Open an input file for random access reading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
str
|
The source to open for reading. |
required |
Returns:
| Type | Description |
|---|---|
PythonFile
|
NativeFile |
open_input_stream
open_input_stream(path: str) -> pa.PythonFile
Open an input stream for sequential reading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
str
|
The source to open for reading. |
required |
Returns:
| Type | Description |
|---|---|
PythonFile
|
NativeFile |
open_output_stream
open_output_stream(
path: str, metadata: Optional[Dict[str, str]] = None
) -> pa.PythonFile
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
str
|
The source to open for writing. |
required |
metadata |
Optional[Dict[str, str]]
|
If not None, a mapping of string keys to string values. |
None
|
Returns:
| Type | Description |
|---|---|
PythonFile
|
NativeFile |