首页#

Woodwork

Woodwork 是一个库,用于帮助对二维表格数据结构进行数据类型定义。#

它在 DataFrame 上提供了一个特殊的命名空间,ww,其中包含物理类型、逻辑类型和语义数据类型。它可与 FeaturetoolsEvalML 以及需要逻辑和语义类型信息的一般机器学习应用程序一起使用。

Woodwork 提供了简单的接口,用于添加和更新逻辑类型和语义类型信息,以及根据类型选择数据列。

快速入门#

下面是使用 Woodwork 自动推断 DataFrame 的逻辑类型并选择具有特定类型的列的示例。

[1]:
import woodwork as ww

df = ww.demo.load_retail(nrows=100, init_woodwork=False)

df.ww.init(name="retail")
df.ww
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[1]:
物理类型 逻辑类型 语义标签
order_product_id int64 整型 ['numeric']
order_id int64 整型 ['numeric']
product_id string 未知 []
description string 自然语言 []
quantity int64 整型 ['numeric']
order_date datetime64[ns] 日期时间 []
unit_price float64 双精度浮点型 ['numeric']
customer_name category 类别型 ['category']
country category 类别型 ['category']
total float64 双精度浮点型 ['numeric']
cancelled bool 布尔型 []
[2]:
filtered_df = df.ww.select(include=["numeric", "Boolean"])
filtered_df.head(5)
[2]:
order_product_id order_id quantity unit_price total cancelled
0 0 536365 6 4.2075 25.245 False
1 1 536365 6 5.5935 33.561 False
2 2 536365 8 4.5375 36.300 False
3 3 536365 6 5.5935 33.561 False
4 4 536365 6 5.5935 33.561 False

目录#