首页#
Woodwork 是一个库,用于帮助对二维表格数据结构进行数据类型定义。#
它在 DataFrame 上提供了一个特殊的命名空间,ww
,其中包含物理类型、逻辑类型和语义数据类型。它可与 Featuretools、EvalML 以及需要逻辑和语义类型信息的一般机器学习应用程序一起使用。
Woodwork 提供了简单的接口,用于添加和更新逻辑类型和语义类型信息,以及根据类型选择数据列。
快速入门#
下面是使用 Woodwork 自动推断 DataFrame 的逻辑类型并选择具有特定类型的列的示例。
[1]:
import woodwork as ww
df = ww.demo.load_retail(nrows=100, init_woodwork=False)
df.ww.init(name="retail")
df.ww
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[1]:
物理类型 | 逻辑类型 | 语义标签 | |
---|---|---|---|
列 | |||
order_product_id | int64 | 整型 | ['numeric'] |
order_id | int64 | 整型 | ['numeric'] |
product_id | string | 未知 | [] |
description | string | 自然语言 | [] |
quantity | int64 | 整型 | ['numeric'] |
order_date | datetime64[ns] | 日期时间 | [] |
unit_price | float64 | 双精度浮点型 | ['numeric'] |
customer_name | category | 类别型 | ['category'] |
country | category | 类别型 | ['category'] |
total | float64 | 双精度浮点型 | ['numeric'] |
cancelled | bool | 布尔型 | [] |
[2]:
filtered_df = df.ww.select(include=["numeric", "Boolean"])
filtered_df.head(5)
[2]:
order_product_id | order_id | quantity | unit_price | total | cancelled | |
---|---|---|---|---|---|---|
0 | 0 | 536365 | 6 | 4.2075 | 25.245 | False |
1 | 1 | 536365 | 6 | 5.5935 | 33.561 | False |
2 | 2 | 536365 | 8 | 4.5375 | 36.300 | False |
3 | 3 | 536365 | 6 | 5.5935 | 33.561 | False |
4 | 4 | 536365 | 6 | 5.5935 | 33.561 | False |