保存和加载 DataFrames#

在本指南中,您将学习如何保存和加载 Woodwork DataFrames。

保存 Woodwork DataFrame#

定义了具有适当逻辑类型和语义标签的 Woodwork DataFrame 后,您可以使用 DataFrame.ww.to_disk 方法保存 DataFrame 和类型信息。默认情况下,此方法将创建一个目录,其中包含一个 data 文件夹和一个 woodwork_typing_info.json 文件,但用户可以根据需要指定不同的值。有关使用 to_disk 方法时可指定参数的更多信息,请参阅 API 指南

为了说明,我们将使用这个已配置了 Woodwork 类型信息的零售 DataFrame。

[1]:
from woodwork.demo import load_retail

df = load_retail(nrows=100)
df.ww.schema
[1]:
逻辑类型 语义标签
order_product_id 分类 ['index']
order_id 分类 ['category']
product_id 分类 ['category']
description 自然语言 []
quantity 整型 ['numeric']
order_date 日期时间 ['time_index']
unit_price 双精度 ['numeric']
customer_name 分类 ['category']
country 分类 ['category']
total 双精度 ['numeric']
cancelled 布尔型 []
[2]:
df.head()
[2]:
order_product_id order_id product_id description quantity order_date unit_price customer_name country total cancelled
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 4.2075 Andrea Brown United Kingdom 25.245 False
1 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 4.5375 Andrea Brown United Kingdom 36.300 False
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False

ww 访问器中,使用 to_disk 保存 Woodwork DataFrame。

[3]:
df.ww.to_disk("retail")

您应该会看到一个包含数据和类型信息的新目录。

retail
├── data
│   └── demo_retail_data.csv
└── woodwork_typing_info.json

数据目录#

data 目录包含按指定格式写入的底层数据。如果用户未指定文件名,则方法会从 DataFrame.ww.name 派生文件名,并使用 CSV 作为默认格式。您可以通过将方法的 format 参数设置为以下任一格式来更改格式

  • csv (默认)

  • pickle

  • parquet

  • arrow

  • feather

  • orc

类型信息#

woodwork_typing_info.json 文件中,您可以看到与 DataFrame 相关的所有类型信息和元数据。这些信息包括

  • 保存 DataFrame 时使用的模式版本

  • DataFrame.ww.name 指定的 DataFrame 名称

  • 索引和时间索引的列名

  • 列类型信息,其中包含每个列的逻辑类型及其参数和语义标签

  • DataFrame 类型和文件格式所需的加载信息

  • DataFrame.ww.metadata 提供的表元数据(必须可进行 JSON 序列化)

{
    "schema_version": "10.0.2",
    "name": "demo_retail_data",
    "index": "order_product_id",
    "time_index": "order_date",
    "column_typing_info": [...],
    "loading_info": {
        "table_type": "pandas",
        "location": "data/demo_retail_data.csv",
        "type": "csv",
        "params": {
            "compression": null,
            "sep": ",",
            "encoding": "utf-8",
            "engine": "python",
            "index": false
        }
    },
    "table_metadata": {}
}

加载 Woodwork DataFrame#

保存 Woodwork DataFrame 后,您可以使用 woodwork.deserialize.from_disk 加载 DataFrame 和类型信息。此函数将使用指定目录中存储的类型信息重新创建 Woodwork DataFrame。

如果您修改了文件名、数据子目录或类型信息文件的任何默认值,则在调用 from_disk 时需要指定这些值。由于在本示例中我们没有更改任何默认值,因此此处无需指定它们。

[4]:
from woodwork.deserialize import from_disk

df = from_disk("retail")
df.ww.schema
[4]:
逻辑类型 语义标签
order_product_id 分类 ['index']
order_id 分类 ['category']
product_id 分类 ['category']
description 自然语言 []
quantity 整型 ['numeric']
order_date 日期时间 ['time_index']
unit_price 双精度 ['numeric']
customer_name 分类 ['category']
country 分类 ['category']
total 双精度 ['numeric']
cancelled 布尔型 []

分别加载 DataFrame 和类型信息#

您还可以使用 woodwork.read_file 加载 Woodwork DataFrame,而不加载类型信息。如果您想快速开始并让 Woodwork 根据底层数据推断类型信息,这种方法非常有用。为了说明,让我们直接将上一个示例中的 CSV 文件读取到 Woodwork DataFrame 中。

[5]:
from woodwork import read_file

df = read_file("retail/data/demo_retail_data.csv")
df.ww
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[5]:
物理类型 逻辑类型 语义标签
order_product_id int64 整型 ['numeric']
order_id int64 整型 ['numeric']
product_id string 未知 []
description string 自然语言 []
quantity int64 整型 ['numeric']
order_date datetime64[ns] 日期时间 []
unit_price float64 双精度 ['numeric']
customer_name category 分类 ['category']
country category 分类 ['category']
total float64 双精度 ['numeric']
cancelled bool 布尔型 []

read_file 中,类型信息是可选的。因此,您仍然可以指定类型信息参数来控制 Woodwork 的初始化方式。为了说明,我们将不同格式的数据文件直接读取到 Woodwork DataFrame 中,并使用此类型信息。

[6]:
typing_information = {
    "index": "order_product_id",
    "time_index": "order_date",
    "logical_types": {
        "order_product_id": "Categorical",
        "order_id": "Categorical",
        "product_id": "Categorical",
        "description": "NaturalLanguage",
        "quantity": "Integer",
        "order_date": "Datetime",
        "unit_price": "Double",
        "customer_name": "Categorical",
        "country": "Categorical",
        "total": "Double",
        "cancelled": "Boolean",
    },
    "semantic_tags": {
        "order_id": {"category"},
        "product_id": {"category"},
        "quantity": {"numeric"},
        "unit_price": {"numeric"},
        "customer_name": {"category"},
        "country": {"category"},
        "total": {"numeric"},
    },
}

首先,让我们从 pandas DataFrame 创建不同格式的数据文件。

[7]:
import pandas as pd

pandas_df = pd.read_csv("retail/data/demo_retail_data.csv")
pandas_df.to_csv("retail/data.csv")
pandas_df.to_parquet("retail/data.parquet")
pandas_df.to_feather("retail/data.feather")

现在,您可以使用 read_file 根据您的类型信息将数据直接加载到 Woodwork DataFrame 中。此函数使用 content_type 参数来确定文件格式。如果未指定 content_type,它将尝试从文件扩展名推断文件格式。

[8]:
woodwork_df = read_file(
    filepath="retail/data.csv",
    content_type="text/csv",
    **typing_information,
)

woodwork_df = read_file(
    filepath="retail/data.parquet",
    content_type="application/parquet",
    **typing_information,
)

woodwork_df = read_file(
    filepath="retail/data.feather",
    content_type="application/feather",
    **typing_information,
)

woodwork_df.ww
[8]:
物理类型 逻辑类型 语义标签
order_product_id category 分类 ['index']
order_id category 分类 ['category']
product_id category 分类 ['category']
description string 自然语言 []
quantity int64 整型 ['numeric']
order_date datetime64[ns] 日期时间 ['time_index']
unit_price float64 双精度 ['numeric']
customer_name category 分类 ['category']
country category 分类 ['category']
total float64 双精度 ['numeric']
cancelled bool 布尔型 []