The ultimate objective of discriminative learning is to train a system to optimize a desired measure of performance. In binary classification we are interested in finding a function that assigns a binary label to a single object, and minimizes the error rate (correct or incorrect) on unseen data.
In structured prediction the input is a complex object (a spoken utterance, an image, a phrase, a graph, etc.), and we are interested in the prediction of a structured label (a sequence of words, a bounding box, a parsing tree, etc.). Typically, each structured prediction task has its own measure of performance or evaluation metric, such as word error rate in speech recognition, the BLEU score in machine translation or the intersection-over-union score in object segmentation.
Consider a supervised learning setting with input instances $x \in \mathcal{X}$ and target labels $y \in \mathcal{Y}$, which refers to a set of complex objects with an internal structure. We assumed a fixed mapping called feature functions (or feature maps) $\phi: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}^d$ from the set of input objects and target labels to a real vector of length $d$. Also consider a linear decoder with parameters $w \in \mathbb{R}^d$, such that $\hat{y}$ is a good approximation to the true label of $x$, as follows:
\begin{equation} \label{eq:decoding} \hat{y} = argmax_{y \in \mathcal{y}} ~ w^\top \phi(x, y) \end{equation}
Ideally, the learning algorithm finds $w$ such that the prediction rule optimizes the expected desired $\textit{measure of preference}$ or $\textit{evaluation metric}$ on unseen data. We define a $\textit{cost}$ function, $\ell(y, \hat{y})$, to be a non-negative measure of error when predicting $\hat{y}$ instead of $y$ as the label of $x$. Our goal is to find the parameters vector $w$ that minimizes this function. Often the desired evaluation metric is a utility function that needs to be maximized (like BLEU or NDCG) and then we define the cost to be 1 minus the evaluation metric. We assume that exists some unknown probability distribution $\rho$ over pairs $(x,y)$ where $y$ is the desired output for input $x$. We would like to set $w$ so as to minimize the expected cost, or the $\textit{risk}$, for predicting $\hat{y}$,
\begin{equation} \label{eq:w*} w^* = argmin_{w} ~ \mathbb{E}_{(x,y) \sim \rho} [\ell(y,\hat{y}(x))]. \end{equation}
This objective function is hard to minimize directly (Keshet, 2014). Given a training set of examples $\mathcal{S} = \{(x_i,y_i)\}_{i=1}^{m}$, where each pair $(x_i, y_i)$ is drawn i.i.d from $\rho$, a common practice is to find the model parameters that minimize the regularized mean surrogate loss,
\begin{equation} \label{eq:reg-loss} w^* = argmin_{w} ~ \frac{1}{m}\sum_{i=1}^{m} \bar{\ell}(w,x_i,y_i) + \frac{\lambda}{2} \|w\|^2, \end{equation}
where $\bar{\ell}(w,x,y)$ is a surrogate loss function, and $\lambda$ is a trade-off parameter between the loss term and the regularization. Each algorithm has its own definition of the surrogate loss, e.g., the surrogate loss in max-margin Markov model (Taskar et al., 2003) is the structured hinge loss with the Hamming cost, whereas the surrogate loss in conditional random fields (Lafferty et al., 2001) is the log loss function. A general survey on structured prediction algorithms and their prediction rules is given in (Keshet, 2014).
Let us first define the loss augmented function as follows:
\begin{equation} \label{eq:loss-augmented} \hat{y}^{\epsilon} = argmax_{\hat{y} \in \mathcal{Y}} ~ w^\top \phi(x, \hat{y}) + \epsilon \, \ell (y,\hat{y}) \end{equation}
where epsilon can be 0 if we wish to use a standard linear decoder, 1 if we wish to use a loss-augmented or any other $\epsilon$ value (as used in DLM).We implemented six structured prediction algorithms in StructED, all the loss functions and update rules are summarized in the following table:
Loss | Update Rule |
---|---|
Structured Perceptron (Collins, 2002). | |
- | $w_{t+1} = w_t + \phi(x_t, y_t) - \phi(x_t,\hat{y}^0_t)$ |
Structured SVM (Tsochantaridis et al., 2005). | |
$max_{\hat{y}} ~[ \ell(y,\hat{y}) - w^\top\phi(x,y) + w^\top\phi(x,\hat{y})]$ | $w_{t+1} = (1-\eta_t \lambda)w_t + \eta_t( \phi(x_t, y_t) - \phi(x_t,\hat{y}^1_t))$ |
Passive Aggressive (Crammer et al., 2006). | |
- | $w_{t+1} = w_{t} + \tau_{t}(\phi(x_t,y_t)) - \phi(x_t,\hat{y}^1_t))$ |
CRF (Lafferty et al., 2001). | |
$-\ln P_{w}(y \,|\, x)$, where $P_{w}(y \,|\, x) = \frac{1}{Z_{w}(x)} \exp\{w^\top \phi(x, y)\}$ | $w_{t+1} = (1-\eta_t \lambda)\,w_{t} + \eta_t\Big( \phi(x_{j_t},y_{j_t}) - \mathbb{E}_{y'\sim P_{w}(y' \,|\, x)}[ \phi(x_{j_t},y')] \Big)$ |
Direct Loss Minimization (McAllester et al., 2010). | |
$\ell (y,\hat{y})$ | $w_{t+1} = w_{t} + \frac{\eta_t}{\epsilon}(\phi(x_t,\hat{y}^{-\epsilon}_t) - \phi(x_t,\hat{y}^0_{t}))$ |
Structured Ramp Loss (McAllester and Keshet, 2011). | |
$\max_{\hat{y}} [ \ell(y,\hat{y}) + w^\top \phi(x,\hat{y})] - \max_{\tilde{y}} [w^\top \phi(x,\tilde{y})]$ | $w_{t+1} = (1-\eta_t\lambda)\,w_{t} + \eta_t(\phi(x_t,\hat{y}^0_t) - \phi(x_t,\hat{y}^1_t))$ |
Probit Loss (Keshet et al., 2011). | |
$\mathbb{E}_{\gamma \sim \mathcal{N}(0,I)}[\ell (y,\hat{y}_{w+\gamma})]$ | $w_{t+1} = (1-\eta_t\lambda)\,w_t + \eta_t \mathbb{E}_{\gamma \sim \mathcal{N}(0,I)}[ \gamma \, \ell (y_t,\hat{y}^0_{w+\gamma})]$ |
Although it is not so popular in structured prediction, we also implemented two kernel expansion functions:
Moreover, in order to support multi-class, we also implemented the feature functions, inference, loss-augmented inference and and use zero-one loss as a cost function. Hence, using StructED as a multi-class classifier is straightforward. More details about using StructED for multi-class tasks can be found here.