任务型对话数据集调研——MultiWOZ 篇

最近为了毕设在做任务型对话（Task-Oriented Dialogue）方面的研究，也写了一篇关于任务型对话的小小介绍：任务型对话系统简介 - Only(AR)'s blog (onlyar.site)。然后几周都在忙着调研大模型在任务型对话中的应用~~和谈恋爱~~，一直对动手操作也没有什么头绪。慢慢我明白了，做一个方向的研究要从数据集开始下手，于是下定决心开始研究常用的数据集。

之前看过的论文里，大部分用的都是 MultiWOZ 数据集和 SGD 数据集，但是我查了一下发现任务型对话的数据集远不止这些¹。但是没有办法嘛，时间和精力有限（后天要开组会了），我就先从常用的几个数据集开始看起。

1 概述

目前常见的对话数据集构造方式可以分为三种：machine-to-machine、human-to-machine 和 human-to-human。

1.1 machine-to-machine

用户和系统的对话数据完全由人工设计的模版构建，这种方案保证了在特定 domain 内的数据完整性和多样性，但是缺点是生成的数据可能和实际情况不匹配，也忽略了实际对话中可能产生的噪声。

1.2 human-to-machine

指利用人机交互的结果来构建数据集，人类扮演用户、已有的对话系统来扮演机器人。其中利用这种方法比较著名的数据集是 DSTC2 和 DSTC3。但是这种方法的问题是需要一个某领域已经成熟的对话系统，对于一个没有成熟对话系统的全新领域构建高质量的 human-to-machine 数据集会变得非常困难。

1.3 human-to-human

很简单就是利用真实的人人对话的数据来构建数据集，但是收集数据会变得非常困难。显然使用互联网上的公开数据（例如 Twitter、Reddit 等）并不能用来方便地构建任务型对话数据集。所以需要使用 WOZ 方法来花费大量的人力来构建，这一过程将会在后面的内容中讲到。

2 MultiWOZ 系列

这一部分主要参考知乎上的一篇文章²，然后自己也去阅读了一下相关的文献。MultiWOZ 的仓库地址在：budzianowski/multiwoz: Source code for end-to-end dialogue model from the MultiWOZ paper (Budzianowski et al. 2018, EMNLP) (github.com)。

仓库中展示了四个版本：

The newest, corrected version of the dataset is available at MultiWOZ_2.2 thanks to the Google crew.
The new, corrected version of the dataset is available at MultiWOZ_2.1 thanks to the Amazon crew.
The dataset used in the EMNLP publication can be accessed at: MultiWOZ_2.0
The dataset used in the ACL publication can be accessed at: MultiWOZ_1.0

2.1 New WOZ

实际上是没有一个数据集的名字叫 MultiWOZ 1.0 的，查一下发现最早的 MultiWOZ 原来是剑桥大学搞的 New WOZ³。WOZ 实际上是一种数据收集的方法⁴，全称叫 Wizard-of-Oz，中文译名叫“绿野仙踪”（什么东西。。。）⁵。简而言之，这个方法的主要思想在于：通过人人对话（human-human）为模型提供高质量语料。

这一份 New WOZ 数据集的内容是游客和旅游中心的工作人员之间的对话，大体就是游客初来乍到，找到一个工作人员问几个附近的餐饮观光住宿之类的问题。在此之前，任务型对话的数据集要么数据量少，要么 domain 单一，而 NEW WOZ 是首个大规模的 multi domain 数据集，一经推出就受到了研究者的热烈欢迎。

数据集信息如下：

数据规模：
- single-domain：2480，multi-domain：7375
- test set：1000，development set：1000
Domains：7
Slots：27
Values：663

2.2 MultiWOZ 2.0

之前的 New WOZ 反响不错，剑桥大学就又重新整理了一下数据集并正式使用 MultiWOZ 这个名字⁶，这份数据集也被称为 MultiWOZ 2.0。MultiWOZ 全称是 MultiDomain Wizard-of-Oz dataset，可以理解为跨 domain 的 human-human 数据集。因为之前的数据集缺少标注、规模小、domain 单一，所以这些人正式下定决心采取众包的方式构建了 MultiWOZ 2.0。

众包是指一种利用大规模分布式劳动力资源来解决问题的一种模式，指将一个大规模的任务分配给互联网上的大量志愿者，最后将结果汇总。众包大量应用于数据标注、图像识别、翻译、设计等。

跟上面提到的 New WOZ 一样，MultiWOZ 描述的也是游客和旅游中心的工作人员之间的对话，其数据信息如下：

数据规模：
- 总对话组数：10438（single-domain：3406，multi-domain：7032）
- 总对话轮数：115434
- test set：1000，development set：1000
Domain：7（Attraction, Hospital, Police, Hotel, Restaurant, Taxi, Train）
后四个 domain 包含子任务“预定（booking）”
每一组对话包含 1~5 个 domian

同时，这个数据集一经推出就成为了任务型对话领域新的 benchmark，它能用于三个方面的评测任务：

对话状态追踪（Dialogue State Tracking）
对话上下文到文本生成（Dialogue-Context-to-Text Generation）
1. Inform rate：系统是否提供了合适的实体
2. Success rate：回答所有请求的属性
3. BLEU：回答的流畅度
结构化信息到自然语言的文本生成（Dialogue-Act-to-Text Generation）

2.3 MultiWOZ 2.1

在剑桥大学发布了 MultiWOZ 2.0 以后，科研工作者在不断地使用中逐渐发现了其中存在的一些问题。首先，对话状态标注和对话文本中存在大量噪声，这会对 DST 模型的性能产生负面影响。其次，后续工作中又将其他内容添加到原来的数据集。这导致数据集的多个版本共存。因此，Amazon 又重新整理了数据集并发布了 MultiWOZ 2.1⁷。

这些人同样使用众包的方式重新标注了 MultiWOZ 数据集并整理出来了原始数据集中常见的四类错误。此外，还添加了一些额外的对话和对槽（slot）的描述（方便进行零样本试验）。还用当时的 SOTA 方法在上面测试作为 baseline。

数据规模：
- 总对话数：总对话组数：10433
- 总对话轮数：104330
- test set：1000，validation set：999

每一条对话的数据结构大致如下：

{
  "original dialog id": "SNG01856.json",
  "new dialog id": "MultiWOZ_2.1--train--1",
  "dialog index": 1,
  "original dialog info": ...,
  "log": [
    ...,
    {
      "turn id": 2,
      "user utterance": "no, i just need to make sure it's cheap. oh, and i need parking",
      "system response": "I found 1 cheap hotel for you that includes parking. Do you like me to book it?",
      "dialog history": "<USER> am looking for a place to to stay that has cheap price range it should be in a type of hotel <SYSTEM> Okay, do you have a specific area you want to stay in?",
      "original user side information": ...,
      "original system side information": ...,
      "dst": "hotel parking yes",
      "dst accumulated": "hotel parking yes , hotel pricerange cheap , hotel type hotel",
      "intent": "hotel-inform",
      "external knowledge": "",
      "external knowledge non-flat": ""
    }, ...
  ],
  "prompt": [
    "This is a bot helping users to find a hotel. Given the dialog context and external database, please generate a relevant system response for the user.", ...
  ],
  "external knowledge non-flat": ...,
  "external knowledge": ...,
  "dst knowledge": ...
}

其中 original user side information 示例：

{
  "metadata": {},
  "dialog_act": {
    "Hotel-Inform": [
      [
        "Parking",
        "yes"
      ]
    ]
  },
  "span_info": []
}

其中 original system side information 示例：

{
  "metadata": {
    "hotel": {
      "book": {"booked": [], "stay": "", ...},
      "semi": {"name": "not mentioned",  "area": "not mentioned", ...}
    }, ...
  },
  "dialog_act": {
    "Booking-Inform": [["none", "none"]],
    "Hotel-Inform": [["Price", "cheap"], ["Choice", "1"], ["Parking", "none"]]
  },
  "span_info": [
    ["Hotel-Inform", "Price", "cheap", 3, 3],
    ["Hotel-Inform", "Choice", "1", 2, 2]
  ]
}

其中 external knowledge non-flat 示例：

{
  "metadata": {
    "hotel": [
      {
        "address": "back lane, cambourne",
        "area": "west",
        "internet": "yes",
        "parking": "yes",
        "id": "28",
        "location": [
          52.2213805555556,
          -0.0680333333333333
        ],
        "name": "the cambridge belfry",
        "phone": "01954714600",
        "postcode": "cb236bw",
        "price": {
          "double": "60",
          "single": "60"
        },
        "pricerange": "cheap",
        "stars": "4",
        "takesbookings": "yes",
        "type": "hotel"
      }, ...
    ]
  },
  "slots and values": {
    "hotel": {
      "day": ["wednesday|friday", "monday", ...],
      "people": ["4", "2", ...],
      "stay": ["4", "7", ...], ...
    }
  },
  "intents": {
    "hotel": ["Hotel-Inform", "Hotel-NoOffer", ...],
    "booking": ["Booking-Book",  "Booking-Inform", ...],
    "general": ["general-bye",  "general-greet", ...]
  }
}

external knowledge 和 dst knowledge 是以上信息的字符串表示。

2.3 MultiWOZ 2.2

后来的研究发现 MultiWOZ 2.1 仍然存在一些错误⁸，因此来自谷歌公司的这些人继续为数据集纠错并整理出一份新的数据集，命名为 MultiWOZ 2.2⁹。MultiWOZ 2.2 的贡献主要在以下三点：

改正了 MultiWOZ 2.1 中存在的标注错误、前后不一、槽值错误等问题；
为用户和系统的文本添加了 slot span annotations，还为每个用户的文本中标注了当前的用户意图和请求槽；
在校正后的数据集上对一些 SOTA 对话状态跟踪模型进行了测试。

由于原来的数据集中每一个 slot 的候选值都是可列举的，因此会存在多个字符串表示同一意义的问题（例如 8pm 和 20:00）。而且在对话中的有些槽值更是无法在数据库中精确匹配。这一问题不纠正的话将会为模型的训练带来极大地困难。

首先将 slot 分为可枚举的和不可枚举的两类：

Domain	Categorical slots	Non-categorical slots
Restaurant	pricerange, area, bookday, bookpeople	food, name, booktime
Attraction	area, type	name
Hotel	pricerange, parking, internet, stars, area, type, bookpeople, bookday, bookstay	name
Taxi	-	destination, departure, arriveby, leaveat
Train	destination, departure, day, bookpeople	arriveby, leaveat
Bus	day	departure, destination, leaveat
Hospital		department
Police	-	name

然后标注了对话中的所有当前意图，例如某人在一句话中既要电话号码又要寻找餐厅。最后还标注了“用户向系统询问”的信息，这对于 dialogue policy model 的训练很有意义。

3 DST 任务的评估

对于 DST 模型的评估主要用到两个指标：Joint Goal Accuracy 和 Slot Accuracy¹⁰。

3.1 Joint Goal Accuracy

在对话的每一轮，将对话状态跟踪器的输出和人工真值标注进行对比。其中，人工真值标注包括了所有可能的（domain，slot）对的槽值。联合目标准确率被定义为每个槽位的值都被正确预测的对话轮的比例。如果一个槽位还未被提及，它的人工真值标注被设为 None, 而且值为 None 的槽位也需要被预测。联合目标准确率是一个相对严格的评价指标，即使一个对话轮中只有一个槽位被错误地预测，该轮的对话状态也是错误的。因此，一个对话轮的联合目标准确率的取值要么是 1 ,要么是 0 。

3.2 Slot Accuracy

槽位准确率独立地将每个（domain，slot，value）三元组和其对应的人工真值标注进行比较。与联合目标准确率相比，它的评价粒度更为精细，但不适合评价对话跟踪器的整体性能。每个对话轮中大多数的槽位未被提及（即槽值为 None），即使槽值全部被预测为 None，槽位准确率也会很高。

3.3 DST baseline

使用的评测指标是 JGA，各个系统的评测结果如下，大部分数据来自于budzianowski/multiwoz: Source code for end-to-end dialogue model from the MultiWOZ paper (Budzianowski et al. 2018, EMNLP) (github.com)：

System	MultiWOZ 2.0	MultiWOZ 2.1	MultiWOZ 2.2
MDBT (Ramadan et al., 2018)	15.57
GLAD (Zhong et al., 2018)	35.57
GCE (Nouri and Hosseini-Asl, 2018)	36.27
Neural Reading (Gao et al, 2019)	41.10
HyST (Goel et al, 2019)	44.24
SUMBT (Lee et al, 2019)	46.65
SGD-baseline (Rastogi et al, 2019)		43.4	42.0
TRADE (Wu et al, 2019)	48.62	46.0	45.4
COMER (Ren et al, 2019)	48.79
MERET (Huang et al, 2020)	50.91
In-Context Learning (Codex) (Hu et al. 2022)		50.65
DSTQA (Zhou et al, 2019)	51.44	51.17
SUMBT+LaRL (Lee et al. 2020)	51.52
DS-DST (Zhang et al, 2019)		51.2	51.7
LABES-S2S (Zhang et al, 2020)		51.45
DST-Picklist (Zhang et al, 2019)	54.39	53.3
MinTL-BART (Lin et al, 2020)	52.10	53.62
SST (Chen et al. 2020)		55.23
TripPy (Heck et al. 2020)		55.3
SimpleTOD (Hosseini-Asl et al. 2020)		56.45
PPTOD (Su et al. 2021)	53.89	57.45
ConvBERT-DG + Multi (Mehri et al. 2020)		58.7
PrefineDST (Cho et al. 2021)		58.9* (53.8)
SPACE-2 (He et al. 2022)		59.51
TripPy + SCoRe (Yu et al. 2021)		60.48
TripPy + CoCoAug (Li and Yavuz et al. 2020)		60.53
TripPy + SaCLog (Dai et al. 2021)		60.61
KAGE-GPT2 (Lin et al, 2021)	54.86
AG-DST (Tian et al. 2021)			57.26
SPACE-3 (He et al. 2022)			57.50
SDP-DST (Lee et al. 2021)		56.66	57.60
D3ST (Zhao et al. 2022)		57.80	58.70
DAIR (Huang et al. 2022)			59.98
TOATOD (Bang et al. 2023)		54.97	63.79

【多轮对话】任务型多轮对话数据集和采集方法 - 知乎 (zhihu.com)↩︎
任务型对话系统数据集详解大全（MultiWOZ/DSTC） - 知乎 (zhihu.com)↩︎
RAMADAN O, BUDZIANOWSKI P, GAŠIĆ M. Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing[C/OL]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia. 2018. http://dx.doi.org/10.18653/v1/p18-2069. DOI:10.18653/v1/p18-2069.↩︎
KELLEY J F. An iterative design methodology for user-friendly natural language office information applications[J/OL]. ACM Transactions on Information Systems, 1984, 2(1): 26-41. http://dx.doi.org/10.1145/357417.357420. DOI:10.1145/357417.357420.↩︎
ATIS、WOZ 系列等数据集所用到的 Wizard-of-OZ 方法究竟是什么梗？ - 知乎 (zhihu.com)↩︎
BUDZIANOWSKI P, WEN T H, TSENG B H, et al. MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling[C/OL]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. 2018. http://dx.doi.org/10.18653/v1/d18-1547. DOI:10.18653/v1/d18-1547.↩︎
ERIC M, GOEL R, SAUSENG P, et al. MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines[J]. arXiv: Computation and Language,arXiv: Computation and Language, 2019.↩︎
ZHANG J, HASHIMOTO K, WU C S, et al. Find or Classify? Dual Strategy for Slot-Value Predictions on Multi-Domain Dialog State Tracking[J]. Joint Conference on Lexical and Computational Semantics,Joint Conference on Lexical and Computational Semantics, 2019.↩︎
ZANG X, RASTOGI A, CHEN J. MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines[C/OL]//Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online. 2020. http://dx.doi.org/10.18653/v1/2020.nlp4convai-1.13. DOI:10.18653/v1/2020.nlp4convai-1.13.↩︎
赛尔笔记|基于深度学习方法的对话状态跟踪综述 - 知乎 (zhihu.com)↩︎

NLP笔记

#NLP #任务型对话

任务型对话数据集调研——MultiWOZ 篇

https://onlyar.site/2023/12/12/NLP-TOD-Datasets/

作者

Only(AR)

发布于

2023年12月12日

许可协议

任务型对话数据集调研——其他上一篇

任务型对话系统简介下一篇