保存和加载 DataFrames#
在本指南中,您将学习如何保存和加载 Woodwork DataFrames。
保存 Woodwork DataFrame#
定义了具有适当逻辑类型和语义标签的 Woodwork DataFrame 后,您可以使用 DataFrame.ww.to_disk
方法保存 DataFrame 和类型信息。默认情况下,此方法将创建一个目录,其中包含一个 data
文件夹和一个 woodwork_typing_info.json
文件,但用户可以根据需要指定不同的值。有关使用 to_disk
方法时可指定参数的更多信息,请参阅 API 指南。
为了说明,我们将使用这个已配置了 Woodwork 类型信息的零售 DataFrame。
[1]:
from woodwork.demo import load_retail
df = load_retail(nrows=100)
df.ww.schema
[1]:
逻辑类型 | 语义标签 | |
---|---|---|
列 | ||
order_product_id | 分类 | ['index'] |
order_id | 分类 | ['category'] |
product_id | 分类 | ['category'] |
description | 自然语言 | [] |
quantity | 整型 | ['numeric'] |
order_date | 日期时间 | ['time_index'] |
unit_price | 双精度 | ['numeric'] |
customer_name | 分类 | ['category'] |
country | 分类 | ['category'] |
total | 双精度 | ['numeric'] |
cancelled | 布尔型 | [] |
[2]:
df.head()
[2]:
order_product_id | order_id | product_id | description | quantity | order_date | unit_price | customer_name | country | total | cancelled | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 4.2075 | Andrea Brown | United Kingdom | 25.245 | False |
1 | 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 5.5935 | Andrea Brown | United Kingdom | 33.561 | False |
2 | 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 4.5375 | Andrea Brown | United Kingdom | 36.300 | False |
3 | 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 5.5935 | Andrea Brown | United Kingdom | 33.561 | False |
4 | 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 5.5935 | Andrea Brown | United Kingdom | 33.561 | False |
从 ww
访问器中,使用 to_disk
保存 Woodwork DataFrame。
[3]:
df.ww.to_disk("retail")
您应该会看到一个包含数据和类型信息的新目录。
retail
├── data
│ └── demo_retail_data.csv
└── woodwork_typing_info.json
数据目录#
data
目录包含按指定格式写入的底层数据。如果用户未指定文件名,则方法会从 DataFrame.ww.name
派生文件名,并使用 CSV 作为默认格式。您可以通过将方法的 format
参数设置为以下任一格式来更改格式
csv (默认)
pickle
parquet
arrow
feather
orc
类型信息#
在 woodwork_typing_info.json
文件中,您可以看到与 DataFrame 相关的所有类型信息和元数据。这些信息包括
保存 DataFrame 时使用的模式版本
由
DataFrame.ww.name
指定的 DataFrame 名称索引和时间索引的列名
列类型信息,其中包含每个列的逻辑类型及其参数和语义标签
DataFrame 类型和文件格式所需的加载信息
由
DataFrame.ww.metadata
提供的表元数据(必须可进行 JSON 序列化)
{
"schema_version": "10.0.2",
"name": "demo_retail_data",
"index": "order_product_id",
"time_index": "order_date",
"column_typing_info": [...],
"loading_info": {
"table_type": "pandas",
"location": "data/demo_retail_data.csv",
"type": "csv",
"params": {
"compression": null,
"sep": ",",
"encoding": "utf-8",
"engine": "python",
"index": false
}
},
"table_metadata": {}
}
加载 Woodwork DataFrame#
保存 Woodwork DataFrame 后,您可以使用 woodwork.deserialize.from_disk
加载 DataFrame 和类型信息。此函数将使用指定目录中存储的类型信息重新创建 Woodwork DataFrame。
如果您修改了文件名、数据子目录或类型信息文件的任何默认值,则在调用 from_disk
时需要指定这些值。由于在本示例中我们没有更改任何默认值,因此此处无需指定它们。
[4]:
from woodwork.deserialize import from_disk
df = from_disk("retail")
df.ww.schema
[4]:
逻辑类型 | 语义标签 | |
---|---|---|
列 | ||
order_product_id | 分类 | ['index'] |
order_id | 分类 | ['category'] |
product_id | 分类 | ['category'] |
description | 自然语言 | [] |
quantity | 整型 | ['numeric'] |
order_date | 日期时间 | ['time_index'] |
unit_price | 双精度 | ['numeric'] |
customer_name | 分类 | ['category'] |
country | 分类 | ['category'] |
total | 双精度 | ['numeric'] |
cancelled | 布尔型 | [] |
分别加载 DataFrame 和类型信息#
您还可以使用 woodwork.read_file
加载 Woodwork DataFrame,而不加载类型信息。如果您想快速开始并让 Woodwork 根据底层数据推断类型信息,这种方法非常有用。为了说明,让我们直接将上一个示例中的 CSV 文件读取到 Woodwork DataFrame 中。
[5]:
from woodwork import read_file
df = read_file("retail/data/demo_retail_data.csv")
df.ww
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[5]:
物理类型 | 逻辑类型 | 语义标签 | |
---|---|---|---|
列 | |||
order_product_id | int64 | 整型 | ['numeric'] |
order_id | int64 | 整型 | ['numeric'] |
product_id | string | 未知 | [] |
description | string | 自然语言 | [] |
quantity | int64 | 整型 | ['numeric'] |
order_date | datetime64[ns] | 日期时间 | [] |
unit_price | float64 | 双精度 | ['numeric'] |
customer_name | category | 分类 | ['category'] |
country | category | 分类 | ['category'] |
total | float64 | 双精度 | ['numeric'] |
cancelled | bool | 布尔型 | [] |
在 read_file
中,类型信息是可选的。因此,您仍然可以指定类型信息参数来控制 Woodwork 的初始化方式。为了说明,我们将不同格式的数据文件直接读取到 Woodwork DataFrame 中,并使用此类型信息。
[6]:
typing_information = {
"index": "order_product_id",
"time_index": "order_date",
"logical_types": {
"order_product_id": "Categorical",
"order_id": "Categorical",
"product_id": "Categorical",
"description": "NaturalLanguage",
"quantity": "Integer",
"order_date": "Datetime",
"unit_price": "Double",
"customer_name": "Categorical",
"country": "Categorical",
"total": "Double",
"cancelled": "Boolean",
},
"semantic_tags": {
"order_id": {"category"},
"product_id": {"category"},
"quantity": {"numeric"},
"unit_price": {"numeric"},
"customer_name": {"category"},
"country": {"category"},
"total": {"numeric"},
},
}
首先,让我们从 pandas DataFrame 创建不同格式的数据文件。
[7]:
import pandas as pd
pandas_df = pd.read_csv("retail/data/demo_retail_data.csv")
pandas_df.to_csv("retail/data.csv")
pandas_df.to_parquet("retail/data.parquet")
pandas_df.to_feather("retail/data.feather")
现在,您可以使用 read_file
根据您的类型信息将数据直接加载到 Woodwork DataFrame 中。此函数使用 content_type
参数来确定文件格式。如果未指定 content_type
,它将尝试从文件扩展名推断文件格式。
[8]:
woodwork_df = read_file(
filepath="retail/data.csv",
content_type="text/csv",
**typing_information,
)
woodwork_df = read_file(
filepath="retail/data.parquet",
content_type="application/parquet",
**typing_information,
)
woodwork_df = read_file(
filepath="retail/data.feather",
content_type="application/feather",
**typing_information,
)
woodwork_df.ww
[8]:
物理类型 | 逻辑类型 | 语义标签 | |
---|---|---|---|
列 | |||
order_product_id | category | 分类 | ['index'] |
order_id | category | 分类 | ['category'] |
product_id | category | 分类 | ['category'] |
description | string | 自然语言 | [] |
quantity | int64 | 整型 | ['numeric'] |
order_date | datetime64[ns] | 日期时间 | ['time_index'] |
unit_price | float64 | 双精度 | ['numeric'] |
customer_name | category | 分类 | ['category'] |
country | category | 分类 | ['category'] |
total | float64 | 双精度 | ['numeric'] |
cancelled | bool | 布尔型 | [] |