StructED

Vowel Duration Measurement

In this tutorial we will demonstrate how to use and add new task to the StructED package. We will give code examples for the Task-Loss, Predict - Inference and Feature Functions interfaces. For this tutorial we will use a structured prediction problem called the Vowel Duration Measurement. We provide a small subset from the original data we use in our experiments.

Before we begin!
You can use this tutorial in two ways:

Run the code as we implemented it:
In order to run the tutorials you can simply run the bash script in any of the tutorials directories in our GitHub repository. The tutorials can be found under the tutorials-code directory under vowel folder.

Write the code yourself:
We should setup our development environment first. Open a Java project and add the StructED jar file to your build path (it can be found under the bin directory at StructED repository). You can download all StructED source codes and jar files from GitHub.
Now let's begin!

The Task

In the problem of vowel duration measurement we are provided with speech signal which includes exactly one vowel, preceded and followed by consonants. Our goal is to predict the start and end time of the vowel as accurate as possible. We represent each speech signal as a set of acoustic features $\overline{x}$ = ($x_1$, $x_2$,...,$x_T$) where each $x_i$ (1 $\leq$ i $\leq$ T) is a $d$-dimensional vector. We denote the domain of the feature vectors by $X \subset \mathbb{R}^d$. Since we are dealing with speech, the length of the signals varies from one file to another, thus T is not fixed. We denote by $X^*$ the set of all finite lengths over $X$. More detailed information about the features and the feature functions is descried later on this paper. In addition, each speech segment is accompanied with the vowel start and end time, we denote by the pair $t_{onset}\subset \mathcal{T}$ and $t_{offset}\subset \mathcal{T}$ the onset and offset times of the vowel respectively, where $\mathcal{T} = \{1,...,T\}$, for brevity let us denote ($t_{onset}, t_{offset}$) as $\hat{t}$. To sum up, our goal is to learn a function, denoted $f$, which takes as input a speech signal $\overline{x}$ and returns the sequence $\hat{t}$ which is the start and end time of the vowel. Meaning, $f$ is a function from the domain of all possible CVC speech segments $X^*$ to the domain of all possible onset and offset pairs $\mathcal{T}^2$. We provide a small subset from the data set we use in our experiment, it can be found under db/ directory.

The Code

Here, we present what classes do we need to add and what interfaces do we need to implement. We provide the source code for all of the classes and interfaces.

Task Loss

We now add a task loss class. Every task loss class should implement the TaskLoss interface. In our problem settings we define the task loss as:

\begin{equation} \label{eq:loss} \gamma (\hat{t},\hat{t}^{'}) = max\{0,|\hat{t}_{onset} - \hat{t}^{'}_{onset}| \geq \epsilon_{onset}\} + max\{0, |\hat{t}_{offset} - \hat{t}^{'}_{offset}| \geq \epsilon_{offset}\} \end{equation}

In other words, the loss will be one if y not equal to $\hat{y}$, otherwise it will be zero.

For this we create new Java class that implements the ITaskLoss interface. The code for the implementation of this class is attached to this tutorial.

Inference

Now we create the inference class. Every inference class should implement the IInference interface. In our problem settings the prediction will be a brute force, we will run over all the possible onsets and offsets and predict the one that maximizes $\ w^\top \phi(x, y)$, but we still make a few assumptions to make our code go faster.

We assume that there is a minimum and maximum vowel length possible, denoted by MIN_VOWEL and MAX_VOWEL respectively. Moreover we assume that the vowel is never placed at the start of the signal, hence we start searching after a gap, denote MIN_GAP_END.

For this we create new Java class that implements the IInference interface. The code for the implementation of this class is attached to this tutorial.

Feature Functions

We now add the feature functions. We can divide our feature functions into three types:

Type 1: $\Delta(location, window-size, feature)$

Type 2: $\mu(location_{start}, location_{end}, window-size, feature)$

Type 3: Gamma and Gaussian Distribution over the data

Running The Code

Now we can create the StructEDModel with the interfaces we have just implemented, all we have left to do is to choose the model (learning algorithm). Here is a snippet of such code (the complete code can be found at the package repository under tutorials-code folder):

              
        /* VOWEL DURATION DATA */
        Logger.info("Vowel Duration data example.");
        // <the path to the vowel duration train data>
        String trainPath = "data/vowel/train.vowel"; 
        // <the path to the vowel duration test data>
        String testPath = "data/vowel/test.results"; 

        // parameters
        int epochNum = 1;
        int readerType = 2;
        int isAvg = 1;
        int numExamples2Display = 1;
        Reader reader = getReader(readerType);

        // load the data
        InstancesContainer vowelTrainInstances = reader.readData(trainPath, Consts.SPACE, 
              Consts.COLON_SPLITTER);
        InstancesContainer vowelTestInstances = reader.readData(testPath, Consts.SPACE, 
              Consts.COLON_SPLITTER);
        if (vowelTrainInstances.getSize() == 0) return;

        /* DIRECT LOSS MODEL*/
        // init the first weight vector
        Vector W = new Vector() {{put(0, 0.0);}}; 
        // model parameters
        ArrayList<Double> arguments = new ArrayList<Double>() {{add(0.1);add(-1.51);}}; 
        // task loss parameters
        ArrayList<Double> task_loss_params = new ArrayList<Double>(){{add(1.0);add(2.0);}}; 

        // build the model
        StructEDModel vowel_model = new StructEDModel(W, new DirectLoss(), new TaskLossVowelDuration(),
                new InferenceVowelDurationData(), null, new FeatureFunctionsVowelDuration(), arguments); 
        // train
        vowel_model.train(vowelTrainInstances, task_loss_params, null, epochNum, isAvg);
        // predict 
        vowel_model.predict(vowelTestInstances, task_loss_params, numExamples2Display);

In this case, we used a different reader (lazy reader), and we stored our data in a data structure called it LazyInstanceContainer, this container gets the pathes to all the data and reads the actual data on demand. This data structure inherits from the original Instance container, hence we can use it we only a minot changes. Moreover, we store the raw data here as a 2D-Array, hence we create a new data structure called Examle2D which support this kind of data.