入门#

在本指南中，您将逐步了解如何在 DataFrame 和 Series 上初始化 Woodwork 的示例。在此过程中，您将学习如何更新和移除逻辑类型和语义标签。您还将学习如何使用类型信息来选择数据子集。

类型和标签#

Woodwork 在很大程度上依赖于物理类型、逻辑类型和语义标签的概念。这些概念在使用类型和标签中进行了详细介绍，但我们在此提供简要定义以供参考

物理类型：定义数据在磁盘或内存中的存储方式。
逻辑类型：定义数据应如何解析或解释。
语义标签：提供关于数据含义或如何使用数据的附加信息。

通过读取包含零售销售数据的 DataFrame，开始学习如何使用 Woodwork。

[1]:

import pandas as pd

df = pd.read_csv("https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv")
df["order_product_id"] = range(df.shape[0])
df.head(5)

[1]:

	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled	order_product_id
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	4.2075	Andrea Brown	United Kingdom	25.245	False	0
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	1
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	4.5375	Andrea Brown	United Kingdom	36.300	False	2
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	3
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	4

如您所见，这是一个包含多种不同数据类型（包括日期、分类值、数值和自然语言描述）的 DataFrame。接下来，在该 DataFrame 上初始化 Woodwork。

在 DataFrame 上初始化 Woodwork#

导入 Woodwork 会在您的 DataFrames 上创建一个特殊的命名空间 DataFrame.ww，可用于设置或更新 DataFrame 的类型信息。只要导入了 Woodwork，在 DataFrame 上初始化 Woodwork 就如同在目标 DataFrame 上调用 .ww.init() 一样简单。可以指定可选的 name 参数来标记数据。

[2]:

import woodwork as ww

df.ww.init(name="retail")
df.ww

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(

[2]:

	物理类型	逻辑类型	语义标签
列
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	int64	Integer	['numeric']

仅使用这个简单的调用，Woodwork 就能够通过分析 DataFrame 的 dtypes 以及列中包含的信息来推断数据中存在的逻辑类型。此外，Woodwork 还根据推断出的逻辑类型为某些列添加了语义标签。

警告

Woodwork 使用弱引用来维护从访问器到 DataFrame 的引用。因此，将 Woodwork 调用链式连接到创建新 DataFrame 或 Series 对象的其他调用上可能会有问题。

不要调用 pd.DataFrame({'id':[1, 2, 3]}).ww.init()，而是先将 DataFrame 存储在一个新变量中，然后初始化 Woodwork

df = pd.DataFrame({'id':[1, 2, 3]})
df.ww.init()

所有 Woodwork 方法和属性都可以通过 DataFrame 上的 ww 命名空间访问。从 Woodwork 命名空间调用的 DataFrame 方法将传递给 DataFrame，并且在可能的情况下，Woodwork 将在返回的对象上初始化，前提是它是一个 Series 或 DataFrame。

例如，使用 head 方法创建一个包含原始数据前 5 行的新 DataFrame，并保留 Woodwork 类型信息。

[3]:

head_df = df.ww.head(5)
head_df.ww

[3]:

	物理类型	逻辑类型	语义标签
列
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	int64	Integer	['numeric']

[4]:

head_df

[4]:

	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled	order_product_id
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	4.2075	Andrea Brown	United Kingdom	25.245	False	0
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	1
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	4.5375	Andrea Brown	United Kingdom	36.300	False	2
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	3
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	4

注意

一旦在 DataFrame 上初始化了 Woodwork，建议在执行 DataFrame 操作时通过 ww 命名空间进行，以避免使 Woodwork 的类型信息失效。

更新逻辑类型#

如果初始推断不符合我们的要求，可以将逻辑类型更改为更合适的值。让我们将一些列更改为不同的逻辑类型来演示此过程。在此例中，将 order_product_id 和 country 列的逻辑类型设置为 Categorical，并将 customer_name 的逻辑类型设置为 PersonFullName。

[5]:

df.ww.set_types(
    logical_types={
        "customer_name": "PersonFullName",
        "country": "Categorical",
        "order_product_id": "Categorical",
    }
)
df.ww.types

[5]:

	物理类型	逻辑类型	语义标签
列
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['category']

检查 types 输出中的信息。您可以看到这三列的逻辑类型已更新为您指定的逻辑类型。

选择列#

现在您已经准备好逻辑类型，可以根据它们的逻辑类型选择列的子集。只选择逻辑类型为 Integer 或 Double 的列。

[6]:

numeric_df = df.ww.select(["Integer", "Double"])
numeric_df.ww

[6]:

	物理类型	逻辑类型	语义标签
列
quantity	int64	Integer	['numeric']
unit_price	float64	Double	['numeric']
total	float64	Double	['numeric']

这个选择过程返回了一个新的 Woodwork DataFrame，其中只包含与您指定的逻辑类型匹配的列。选择所需的列后，您可以像平常一样使用只包含这些列的 DataFrame 进行任何额外的分析。

[7]:

numeric_df

[7]:

	quantity	unit_price	total
0	6	4.2075	25.2450
1	6	5.5935	33.5610
2	8	4.5375	36.3000
3	6	5.5935	33.5610
4	6	5.5935	33.5610
...	...	...	...
401599	12	1.4025	16.8300
401600	6	3.4650	20.7900
401601	4	6.8475	27.3900
401602	4	6.8475	27.3900
401603	3	8.1675	24.5025

401604 rows × 3 columns

添加语义标签#

接下来，让我们为一些列添加语义标签。为 description 列添加 product_details 标签，并为 total 列标记 currency。

[8]:

df.ww.set_types(semantic_tags={"description": "product_details", "total": "currency"})
df.ww

[8]:

	物理类型	逻辑类型	语义标签
列
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category', 'product_details']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['currency', 'numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['category']

根据语义标签选择列。只选择标记为 category 的列。

[9]:

category_df = df.ww.select("category")
category_df.ww

[9]:

	物理类型	逻辑类型	语义标签
列
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category', 'product_details']
country	category	Categorical	['category']
order_product_id	category	Categorical	['category']

使用多个语义标签或混合使用语义标签和逻辑类型来选择列。

[10]:

category_numeric_df = df.ww.select(["numeric", "category"])
category_numeric_df.ww

[10]:

	物理类型	逻辑类型	语义标签
列
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category', 'product_details']
quantity	int64	Integer	['numeric']
unit_price	float64	Double	['numeric']
country	category	Categorical	['category']
total	float64	Double	['currency', 'numeric']
order_product_id	category	Categorical	['category']

[11]:

mixed_df = df.ww.select(["Boolean", "product_details"])
mixed_df.ww

[11]:

	物理类型	逻辑类型	语义标签
列
description	category	Categorical	['category', 'product_details']
cancelled	bool	Boolean	[]

要选择单个列，请指定列名。Woodwork 将在返回的 Series 上初始化，您可以根据需要使用该 Series 进行额外的分析。

[12]:

total = df.ww["total"]
total.ww

[12]:

<Series: total (Physical Type = float64) (Logical Type = Double) (Semantic Tags = {'currency', 'numeric'})>

[13]:

total

[13]:

0         25.2450
1         33.5610
2         36.3000
3         33.5610
4         33.5610
           ...
401599    16.8300
401600    20.7900
401601    27.3900
401602    27.3900
401603    24.5025
Name: total, Length: 401604, dtype: float64

通过提供列名列表来选择多个列。

[14]:

multiple_cols_df = df.ww[["product_id", "total", "unit_price"]]
multiple_cols_df.ww

[14]:

	物理类型	逻辑类型	语义标签
列
product_id	category	Categorical	['category']
total	float64	Double	['currency', 'numeric']
unit_price	float64	Double	['numeric']

移除语义标签#

如果不再需要，可以从列中移除特定的语义标签。在此示例中，从 description 列中移除 product_details 标签。

[15]:

df.ww.remove_semantic_tags({"description": "product_details"})
df.ww

[15]:

	物理类型	逻辑类型	语义标签
列
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['currency', 'numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['category']

请注意，product_details 标签已从 description 列中移除。如果您想从所有列中移除所有用户添加的语义标签，您也可以这样做。

[16]:

df.ww.reset_semantic_tags()
df.ww

[16]:

	物理类型	逻辑类型	语义标签
列
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['category']

设置索引和时间索引#

在任何时候，您都可以使用 set_index 和 set_time_index 方法将特定列指定为 Woodwork 的 index 或 time_index。这些方法可用于首次分配这些列，或更改用作索引或时间索引的列。

索引列和时间索引列分别包含 index 和 time_index 语义标签。

[17]:

df.ww.set_index("order_product_id")
df.ww.index

[17]:

'order_product_id'

[18]:

df.ww.set_time_index("order_date")
df.ww.time_index

[18]:

'order_date'

[19]:

df.ww

[19]:

	物理类型	逻辑类型	语义标签
列
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	['time_index']
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['index']

在 Series 上使用 Woodwork#

Woodwork 也可用于在 Series 上存储类型信息。在 Series 上初始化 Woodwork 有两种方法，取决于 Series 的 dtype 是否与关联的逻辑类型 (LogicalType) 的物理类型 (physical type) 相同。有关逻辑类型和物理类型的更多信息，请参阅使用类型和标签。

如果您的 Series dtype 与指定或推断的 LogicalType 关联的物理类型匹配，则可以通过 ww 命名空间初始化 Woodwork，就像 DataFrames 一样。

[20]:

series = pd.Series([1, 2, 3], dtype="int64")
series.ww.init(logical_type="Integer")
series.ww

[20]:

<Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'numeric'})>

在上面的示例中，我们为 Series 指定了 Integer LogicalType。因为 Integer 的物理类型是 int64，并且这与用于创建 Series 的 dtype 匹配，所以不需要进行 Series dtype 转换，初始化成功。

在 LogicalType 需要更改 Series dtype 的情况下，必须使用辅助函数 ww.init_series。此函数将返回一个新的 Series 对象，该对象已初始化 Woodwork，并且 Series 的 dtype 已更改以匹配 LogicalType 的物理类型。

为了演示这种情况，首先创建一个 dtype 为 string 的 Series。然后，使用 init_series 函数初始化一个逻辑类型为 Categorical 的 Woodwork Series。因为 Categorical 使用的物理类型是 category，所以 Series 的 dtype 必须更改，这就是我们在此必须使用 init_series 函数的原因。

返回的 Series 将初始化 Woodwork，并将 LogicalType 设置为预期的 Categorical，其预期的 dtype 为 category。

[21]:

string_series = pd.Series(["a", "b", "a"], dtype="string")
ww_series = ww.init_series(string_series, logical_type="Categorical")
ww_series.ww

[21]:

<Series: None (Physical Type = category) (Logical Type = Categorical) (Semantic Tags = {'category'})>

与 DataFrames 一样，Woodwork 提供了多种方法，可用于更新或更改与 Series 关联的类型信息。例如，为该 Series 添加一个新的语义标签。

[22]:

series.ww.add_semantic_tags("new_tag")
series.ww

[22]:

<Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'new_tag', 'numeric'})>

从上面的输出中可以看到，指定的标签已添加到该 Series 的语义标签中。

您还可以通过 Woodwork 命名空间访问 Series 属性方法。在可能的情况下，Woodwork 类型信息将保留在返回的值上。例如，您可以通过 Woodwork 访问 Series 的 shape 属性。

[23]:

series.ww.shape

[23]:

(3,)

您还可以调用 Series 方法，例如 sample。在这种情况下，通过 sample 方法返回的 Series 上将保留 Woodwork 类型信息。

[24]:

sample_series = series.ww.sample(2)
sample_series.ww

[24]:

<Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'numeric', 'new_tag'})>

[25]:

sample_series

[25]:

1    2
2    3
dtype: int64

列出逻辑类型#

检索 Woodwork 中存在的所有逻辑类型 (Logical Types)。这对于理解逻辑类型以及它们的解释方式非常有用。

[26]:

from woodwork.type_sys.utils import list_logical_types

list_logical_types()

[26]:

	名称	类型字符串	description	物理类型	标准标签	是默认类型	已注册	父类型
0	Address	address	代表包含地址的逻辑类型...	string	{}	True	True	None
1	Age	age	代表包含整数的逻辑类型...	int64	{numeric}	True	True	Integer
2	AgeFractional	age_fractional	代表包含非负数的逻辑类型...	float64	{numeric}	True	True	Double
3	AgeNullable	age_nullable	代表包含整数的逻辑类型...	Int64	{numeric}	True	True	IntegerNullable
4	Boolean	boolean	代表包含二进制值的逻辑类型...	bool	{}	True	True	BooleanNullable
5	BooleanNullable	boolean_nullable	代表包含二进制值的逻辑类型...	boolean	{}	True	True	None
6	Categorical	categorical	代表包含无序值的逻辑类型...	category	{category}	True	True	None
7	CountryCode	country_code	代表使用 ISO-3166 代码的逻辑类型...	category	{category}	True	True	Categorical
8	CurrencyCode	currency_code	代表使用 ISO-4217 代码的逻辑类型...	category	{category}	True	True	Categorical
9	Datetime	datetime	代表包含日期和时间的逻辑类型...	datetime64[ns]	{}	True	True	None
10	Double	double	代表包含正数的逻辑类型...	float64	{numeric}	True	True	None
11	EmailAddress	email_address	代表包含电子邮件地址的逻辑类型...	string	{}	True	True	Unknown
12	Filepath	filepath	代表指定位置的逻辑类型...	string	{}	True	True	None
13	IPAddress	ip_address	代表包含 IP 地址的逻辑类型...	string	{}	True	True	Unknown
14	Integer	integer	代表包含正数的逻辑类型...	int64	{numeric}	True	True	IntegerNullable
15	IntegerNullable	integer_nullable	代表包含正数的逻辑类型...	Int64	{numeric}	True	True	None
16	LatLong	lat_long	代表包含经纬度值的逻辑类型...	object	{}	True	True	None
17	NaturalLanguage	natural_language	代表包含文本或自然语言的逻辑类型...	string	{}	True	True	None
18	Ordinal	ordinal	代表包含有序值的逻辑类型...	category	{category}	True	True	Categorical
19	PersonFullName	person_full_name	代表可能包含名字或全名的逻辑类型...	string	{}	True	True	None
20	PhoneNumber	phone_number	代表包含数字电话号码的逻辑类型...	string	{}	True	True	Unknown
21	PostalCode	postal_code	代表包含一系列字符或数字的逻辑类型...	category	{category}	True	True	Categorical
22	SubRegionCode	sub_region_code	代表使用 ISO-3166 代码的逻辑类型...	category	{category}	True	True	Categorical
23	Timedelta	timedelta	代表包含表示时间间隔的值的逻辑类型...	timedelta64[ns]	{}	True	True	Unknown
24	URL	url	代表包含 URL 的逻辑类型，即...	string	{}	True	True	Unknown
25	Unknown	unknown	代表无法推断的逻辑类型...	string	{}	True	True	None