Introduction

What is Structured Prediction?

The ultimate objective of discriminative learning is to train a system to optimize a desired measure of performance. In binary classification we are interested in finding a function that assigns a binary label to a single object, and minimizes the error rate (correct or incorrect) on unseen data.

In structured prediction the input is a complex object (a spoken utterance, an image, a phrase, a graph, etc.), and we are interested in the prediction of a structured label (a sequence of words, a bounding box, a parsing tree, etc.). Typically, each structured prediction task has its own measure of performance or evaluation metric, such as word error rate in speech recognition, the BLEU score in machine translation or the intersection-over-union score in object segmentation.

Example In the problem of vowel duration measurement, our goal is to predict the start and end times of a vowel phoneme in a spoken word. One approach is to simply train a binary classifier at the frame level to predict for each frame weather it is a vowel or not. This approach does not take into consideration the internal structure of the desire label and relation between its components, such as typical vowel duration, or its temporal and spectral characteristics. In contrary, the structured prediction approach can look at the desired label as one piece and define a set of features maps that captures the relations between target internal components. A typical feature can be the presume duration of the vowel relative to the average vowel duration. We provide very detailed tutorial on the vowel duration measurement which can be found here.

We begin by posing the structured learning setting. We consider a supervised learning setting with input objects $x \in \mathcal{X}$ and target labels $y \in \mathcal{Y}$. The labels may be sequences, trees, grids, or other high-dimensional objects with internal structure. We assumed a fixed mapping $\phi: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}^d$ from the set of input objects and target labels to a real vector of length $d$. We call the elements of this mapping feature functions/feature maps.

Example In the problem of pronunciation modeling, the input $x$ is a pronounced sequence of phones, and the output $y$ is the pronounced word. Most often pronunciation variations cannot be found in a lexicon of canonical pronunciation, and have to be modeled statistically. In this task, the feature functions map a pronunciation from a set of pronunciations of different lengths along with a proposed word to a vector of fixed dimension in $\mathbb{R}^d$. One such feature function might measure the edit distance between the pronunciation $x$ and the canonical pronunciation of the word $y$ in the lexicon. This feature function counts the minimum number of edit operations (insertions, deletions, and substitutions) that are needed to convert the actual pronunciation to the lexicon pronunciation; it is low if the actual pronunciation is close to the lexicon and high otherwise. See Hao, Keshet, and Livescu (2012) and the references therein for full description of this task and other examples of feature functions.

More Formally

Consider a supervised learning setting with input instances $x \in \mathcal{X}$ and target labels $y \in \mathcal{Y}$, which refers to a set of complex objects with an internal structure. We assumed a fixed mapping called feature functions (or feature maps) $\phi: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}^d$ from the set of input objects and target labels to a real vector of length $d$. Also consider a linear decoder with parameters $w \in \mathbb{R}^d$, such that $\hat{y}$ is a good approximation to the true label of $x$, as follows:

$$\label{eq:decoding} \hat{y} = argmax_{y \in \mathcal{y}} ~ w^\top \phi(x, y)$$

Ideally, the learning algorithm finds $w$ such that the prediction rule optimizes the expected desired $\textit{measure of preference}$ or $\textit{evaluation metric}$ on unseen data. We define a $\textit{cost}$ function, $\ell(y, \hat{y})$, to be a non-negative measure of error when predicting $\hat{y}$ instead of $y$ as the label of $x$. Our goal is to find the parameters vector $w$ that minimizes this function. Often the desired evaluation metric is a utility function that needs to be maximized (like BLEU or NDCG) and then we define the cost to be 1 minus the evaluation metric. We assume that exists some unknown probability distribution $\rho$ over pairs $(x,y)$ where $y$ is the desired output for input $x$. We would like to set $w$ so as to minimize the expected cost, or the $\textit{risk}$, for predicting $\hat{y}$,

$$\label{eq:w*} w^* = argmin_{w} ~ \mathbb{E}_{(x,y) \sim \rho} [\ell(y,\hat{y}(x))].$$

This objective function is hard to minimize directly (Keshet, 2014). Given a training set of examples $\mathcal{S} = \{(x_i,y_i)\}_{i=1}^{m}$, where each pair $(x_i, y_i)$ is drawn i.i.d from $\rho$, a common practice is to find the model parameters that minimize the regularized mean surrogate loss,

$$\label{eq:reg-loss} w^* = argmin_{w} ~ \frac{1}{m}\sum_{i=1}^{m} \bar{\ell}(w,x_i,y_i) + \frac{\lambda}{2} \|w\|^2,$$

where $\bar{\ell}(w,x,y)$ is a surrogate loss function, and $\lambda$ is a trade-off parameter between the loss term and the regularization. Each algorithm has its own definition of the surrogate loss, e.g., the surrogate loss in max-margin Markov model (Taskar et al., 2003) is the structured hinge loss with the Hamming cost, whereas the surrogate loss in conditional random fields (Lafferty et al., 2001) is the log loss function. A general survey on structured prediction algorithms and their prediction rules is given in (Keshet, 2014).

Modules

Let us first define the loss augmented function as follows:

$$\label{eq:loss-augmented} \hat{y}^{\epsilon} = argmax_{\hat{y} \in \mathcal{Y}} ~ w^\top \phi(x, \hat{y}) + \epsilon \, \ell (y,\hat{y})$$

where epsilon can be 0 if we wish to use a standard linear decoder, 1 if we wish to use a loss-augmented or any other $\epsilon$ value (as used in DLM).

We implemented six structured prediction algorithms in StructED, all the loss functions and update rules are summarized in the following table:

Loss Update Rule
Structured Perceptron (Collins, 2002).
- $w_{t+1} = w_t + \phi(x_t, y_t) - \phi(x_t,\hat{y}^0_t)$
Structured SVM (Tsochantaridis et al., 2005).
$max_{\hat{y}} ~[ \ell(y,\hat{y}) - w^\top\phi(x,y) + w^\top\phi(x,\hat{y})]$ $w_{t+1} = (1-\eta_t \lambda)w_t + \eta_t( \phi(x_t, y_t) - \phi(x_t,\hat{y}^1_t))$
Passive Aggressive (Crammer et al., 2006).
- $w_{t+1} = w_{t} + \tau_{t}(\phi(x_t,y_t)) - \phi(x_t,\hat{y}^1_t))$
CRF (Lafferty et al., 2001).
$-\ln P_{w}(y \,|\, x)$, where $P_{w}(y \,|\, x) = \frac{1}{Z_{w}(x)} \exp\{w^\top \phi(x, y)\}$ $w_{t+1} = (1-\eta_t \lambda)\,w_{t} + \eta_t\Big( \phi(x_{j_t},y_{j_t}) - \mathbb{E}_{y'\sim P_{w}(y' \,|\, x)}[ \phi(x_{j_t},y')] \Big)$
Direct Loss Minimization (McAllester et al., 2010).
$\ell (y,\hat{y})$ $w_{t+1} = w_{t} + \frac{\eta_t}{\epsilon}(\phi(x_t,\hat{y}^{-\epsilon}_t) - \phi(x_t,\hat{y}^0_{t}))$
Structured Ramp Loss (McAllester and Keshet, 2011).
$\max_{\hat{y}} [ \ell(y,\hat{y}) + w^\top \phi(x,\hat{y})] - \max_{\tilde{y}} [w^\top \phi(x,\tilde{y})]$ $w_{t+1} = (1-\eta_t\lambda)\,w_{t} + \eta_t(\phi(x_t,\hat{y}^0_t) - \phi(x_t,\hat{y}^1_t))$
Probit Loss (Keshet et al., 2011).
$\mathbb{E}_{\gamma \sim \mathcal{N}(0,I)}[\ell (y,\hat{y}_{w+\gamma})]$ $w_{t+1} = (1-\eta_t\lambda)\,w_t + \eta_t \mathbb{E}_{\gamma \sim \mathcal{N}(0,I)}[ \gamma \, \ell (y_t,\hat{y}^0_{w+\gamma})]$

Although it is not so popular in structured prediction, we also implemented two kernel expansion functions:

1. Polynomial.
2. RBF using 2nd and 3rd Taylor approximation (Cotter, Keshet and Srebro, 2011)

Moreover, in order to support multi-class, we also implemented the feature functions, inference, loss-augmented inference and and use zero-one loss as a cost function. Hence, using StructED as a multi-class classifier is straightforward. More details about using StructED for multi-class tasks can be found here.