保存和加载 DataFrames#

在本指南中，您将学习如何保存和加载 Woodwork DataFrames。

保存 Woodwork DataFrame#

定义了具有适当逻辑类型和语义标签的 Woodwork DataFrame 后，您可以使用 DataFrame.ww.to_disk 方法保存 DataFrame 和类型信息。默认情况下，此方法将创建一个目录，其中包含一个 data 文件夹和一个 woodwork_typing_info.json 文件，但用户可以根据需要指定不同的值。有关使用 to_disk 方法时可指定参数的更多信息，请参阅 API 指南。

为了说明，我们将使用这个已配置了 Woodwork 类型信息的零售 DataFrame。

[1]:

from woodwork.demo import load_retail

df = load_retail(nrows=100)
df.ww.schema

[1]:

	逻辑类型	语义标签
列
order_product_id	分类	['index']
order_id	分类	['category']
product_id	分类	['category']
description	自然语言	[]
quantity	整型	['numeric']
order_date	日期时间	['time_index']
unit_price	双精度	['numeric']
customer_name	分类	['category']
country	分类	['category']
total	双精度	['numeric']
cancelled	布尔型	[]

[2]:

df.head()

[2]:

	order_product_id	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled
0	0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	4.2075	Andrea Brown	United Kingdom	25.245	False
1	1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
2	2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	4.5375	Andrea Brown	United Kingdom	36.300	False
3	3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
4	4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False

从 ww 访问器中，使用 to_disk 保存 Woodwork DataFrame。

[3]:

df.ww.to_disk("retail")

您应该会看到一个包含数据和类型信息的新目录。

retail
├── data
│   └── demo_retail_data.csv
└── woodwork_typing_info.json

数据目录#

data 目录包含按指定格式写入的底层数据。如果用户未指定文件名，则方法会从 DataFrame.ww.name 派生文件名，并使用 CSV 作为默认格式。您可以通过将方法的 format 参数设置为以下任一格式来更改格式

csv (默认)
pickle
parquet
arrow
feather
orc

类型信息#

在 woodwork_typing_info.json 文件中，您可以看到与 DataFrame 相关的所有类型信息和元数据。这些信息包括

保存 DataFrame 时使用的模式版本
由 DataFrame.ww.name 指定的 DataFrame 名称
索引和时间索引的列名
列类型信息，其中包含每个列的逻辑类型及其参数和语义标签
DataFrame 类型和文件格式所需的加载信息
由 DataFrame.ww.metadata 提供的表元数据（必须可进行 JSON 序列化）

{
    "schema_version": "10.0.2",
    "name": "demo_retail_data",
    "index": "order_product_id",
    "time_index": "order_date",
    "column_typing_info": [...],
    "loading_info": {
        "table_type": "pandas",
        "location": "data/demo_retail_data.csv",
        "type": "csv",
        "params": {
            "compression": null,
            "sep": ",",
            "encoding": "utf-8",
            "engine": "python",
            "index": false
        }
    },
    "table_metadata": {}
}

加载 Woodwork DataFrame#

保存 Woodwork DataFrame 后，您可以使用 woodwork.deserialize.from_disk 加载 DataFrame 和类型信息。此函数将使用指定目录中存储的类型信息重新创建 Woodwork DataFrame。

如果您修改了文件名、数据子目录或类型信息文件的任何默认值，则在调用 from_disk 时需要指定这些值。由于在本示例中我们没有更改任何默认值，因此此处无需指定它们。

[4]:

from woodwork.deserialize import from_disk

df = from_disk("retail")
df.ww.schema

[4]:

	逻辑类型	语义标签
列
order_product_id	分类	['index']
order_id	分类	['category']
product_id	分类	['category']
description	自然语言	[]
quantity	整型	['numeric']
order_date	日期时间	['time_index']
unit_price	双精度	['numeric']
customer_name	分类	['category']
country	分类	['category']
total	双精度	['numeric']
cancelled	布尔型	[]

分别加载 DataFrame 和类型信息#

您还可以使用 woodwork.read_file 加载 Woodwork DataFrame，而不加载类型信息。如果您想快速开始并让 Woodwork 根据底层数据推断类型信息，这种方法非常有用。为了说明，让我们直接将上一个示例中的 CSV 文件读取到 Woodwork DataFrame 中。

[5]:

from woodwork import read_file

df = read_file("retail/data/demo_retail_data.csv")
df.ww

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(

[5]:

	物理类型	逻辑类型	语义标签
列
order_product_id	int64	整型	['numeric']
order_id	int64	整型	['numeric']
product_id	string	未知	[]
description	string	自然语言	[]
quantity	int64	整型	['numeric']
order_date	datetime64[ns]	日期时间	[]
unit_price	float64	双精度	['numeric']
customer_name	category	分类	['category']
country	category	分类	['category']
total	float64	双精度	['numeric']
cancelled	bool	布尔型	[]

在 read_file 中，类型信息是可选的。因此，您仍然可以指定类型信息参数来控制 Woodwork 的初始化方式。为了说明，我们将不同格式的数据文件直接读取到 Woodwork DataFrame 中，并使用此类型信息。

[6]:

typing_information = {
    "index": "order_product_id",
    "time_index": "order_date",
    "logical_types": {
        "order_product_id": "Categorical",
        "order_id": "Categorical",
        "product_id": "Categorical",
        "description": "NaturalLanguage",
        "quantity": "Integer",
        "order_date": "Datetime",
        "unit_price": "Double",
        "customer_name": "Categorical",
        "country": "Categorical",
        "total": "Double",
        "cancelled": "Boolean",
    },
    "semantic_tags": {
        "order_id": {"category"},
        "product_id": {"category"},
        "quantity": {"numeric"},
        "unit_price": {"numeric"},
        "customer_name": {"category"},
        "country": {"category"},
        "total": {"numeric"},
    },
}

首先，让我们从 pandas DataFrame 创建不同格式的数据文件。

[7]:

import pandas as pd

pandas_df = pd.read_csv("retail/data/demo_retail_data.csv")
pandas_df.to_csv("retail/data.csv")
pandas_df.to_parquet("retail/data.parquet")
pandas_df.to_feather("retail/data.feather")

现在，您可以使用 read_file 根据您的类型信息将数据直接加载到 Woodwork DataFrame 中。此函数使用 content_type 参数来确定文件格式。如果未指定 content_type，它将尝试从文件扩展名推断文件格式。

[8]:

woodwork_df = read_file(
    filepath="retail/data.csv",
    content_type="text/csv",
    **typing_information,
)

woodwork_df = read_file(
    filepath="retail/data.parquet",
    content_type="application/parquet",
    **typing_information,
)

woodwork_df = read_file(
    filepath="retail/data.feather",
    content_type="application/feather",
    **typing_information,
)

woodwork_df.ww

[8]:

	物理类型	逻辑类型	语义标签
列
order_product_id	category	分类	['index']
order_id	category	分类	['category']
product_id	category	分类	['category']
description	string	自然语言	[]
quantity	int64	整型	['numeric']
order_date	datetime64[ns]	日期时间	['time_index']
unit_price	float64	双精度	['numeric']
customer_name	category	分类	['category']
country	category	分类	['category']
total	float64	双精度	['numeric']
cancelled	bool	布尔型	[]