Vowel Duration Measurement


In this tutorial we will demonstrate how to use and add new task to the StructED package. We will give code examples for the Task-Loss, Predict - Inference and Feature Functions interfaces. For this tutorial we will use a structured prediction problem called the Vowel Duration Measurement. We provide a small subset from the original data we use in our experiments.

Before we begin!

You can use this tutorial in two ways:
  1. Run the code as we implemented it:
  2. In order to run the tutorials you can simply run the bash script in any of the tutorials directories in our GitHub repository. The tutorials can be found under the tutorials-code directory under vowel folder.

  3. Write the code yourself:
  4. We should setup our development environment first. Open a Java project and add the StructED jar file to your build path (it can be found under the bin directory at StructED repository). You can download all StructED source codes and jar files from GitHub.
Now let's begin!

The Task


In the problem of vowel duration measurement we are provided with speech signal which includes exactly one vowel, preceded and followed by consonants. Our goal is to predict the start and end time of the vowel as accurate as possible. We represent each speech signal as a set of acoustic features $\overline{x}$ = ($x_1$, $x_2$,...,$x_T$) where each $x_i$ (1 $\leq$ i $\leq$ T) is a $d$-dimensional vector. We denote the domain of the feature vectors by $X \subset \mathbb{R}^d$. Since we are dealing with speech, the length of the signals varies from one file to another, thus T is not fixed. We denote by $X^*$ the set of all finite lengths over $X$. More detailed information about the features and the feature functions is descried later on this paper. In addition, each speech segment is accompanied with the vowel start and end time, we denote by the pair $t_{onset}\subset \mathcal{T}$ and $t_{offset}\subset \mathcal{T}$ the onset and offset times of the vowel respectively, where $\mathcal{T} = \{1,...,T\}$, for brevity let us denote ($t_{onset}, t_{offset}$) as $\hat{t}$. To sum up, our goal is to learn a function, denoted $f$, which takes as input a speech signal $\overline{x}$ and returns the sequence $\hat{t}$ which is the start and end time of the vowel. Meaning, $f$ is a function from the domain of all possible CVC speech segments $X^*$ to the domain of all possible onset and offset pairs $\mathcal{T}^2$. We provide a small subset from the data set we use in our experiment, it can be found under db/ directory.

The Code


Here, we present what classes do we need to add and what interfaces do we need to implement. We provide the source code for all of the classes and interfaces.

Task Loss

We now add a task loss class. Every task loss class should implement the TaskLoss interface. In our problem settings we define the task loss as:

\begin{equation} \label{eq:loss} \gamma (\hat{t},\hat{t}^{'}) = max\{0,|\hat{t}_{onset} - \hat{t}^{'}_{onset}| \geq \epsilon_{onset}\} + max\{0, |\hat{t}_{offset} - \hat{t}^{'}_{offset}| \geq \epsilon_{offset}\} \end{equation}


In other words, the loss will be one if y not equal to $\hat{y}$, otherwise it will be zero.

For this we create new Java class that implements the ITaskLoss interface. The code for the implementation of this class is attached to this tutorial.

Inference

Now we create the inference class. Every inference class should implement the IInference interface. In our problem settings the prediction will be a brute force, we will run over all the possible onsets and offsets and predict the one that maximizes $\ w^\top \phi(x, y)$, but we still make a few assumptions to make our code go faster.

We assume that there is a minimum and maximum vowel length possible, denoted by MIN_VOWEL and MAX_VOWEL respectively. Moreover we assume that the vowel is never placed at the start of the signal, hence we start searching after a gap, denote MIN_GAP_END.

For this we create new Java class that implements the IInference interface. The code for the implementation of this class is attached to this tutorial.

Feature Functions

We now add the feature functions. We can divide our feature functions into three types:
  1. Type 1: $\Delta(location, window-size, feature)$
  2. While exploring the data we noticed a rapid increase in certain features when the vowel onset occurs and rapid decrease when the vowel offset occurs. As a result we implemented the $\Delta$ feature function to collect this data. This function computes the difference between the mean of the signal of the given feature, before and after the location parameter in range of window-size parameter.

  3. Type 2: $\mu(location_{start}, location_{end}, window-size, feature)$
  4. Besides the rapid change in the vowel onset and offset, we've also notice that the mean of signal in certain features between the onset and offset is greater then the mean of signal before and after. Hence, we implemented the $\mu$ function. This function computes the difference between the mean of the signal of a given feature from $location_{start}$ to $location_{end}$, to the mean of a signal before $location_{start}$ or after $location_{end}$ in range of window-size parameter.

  5. Type 3: Gamma and Gaussian Distribution over the data
  6. Since our data is distributed somewhere between Gamma and Gaussian distribution, we competed the probability of a given vowel length to occur as for Gamma and Gaussian distributions.

Running The Code


Now we can create the StructEDModel with the interfaces we have just implemented, all we have left to do is to choose the model (learning algorithm). Here is a snippet of such code (the complete code can be found at the package repository under tutorials-code folder):

              
        /* VOWEL DURATION DATA */
        Logger.info("Vowel Duration data example.");
        // <the path to the vowel duration train data>
        String trainPath = "data/vowel/train.vowel"; 
        // <the path to the vowel duration test data>
        String testPath = "data/vowel/test.results"; 

        // parameters
        int epochNum = 1;
        int readerType = 2;
        int isAvg = 1;
        int numExamples2Display = 1;
        Reader reader = getReader(readerType);

        // load the data
        InstancesContainer vowelTrainInstances = reader.readData(trainPath, Consts.SPACE, 
              Consts.COLON_SPLITTER);
        InstancesContainer vowelTestInstances = reader.readData(testPath, Consts.SPACE, 
              Consts.COLON_SPLITTER);
        if (vowelTrainInstances.getSize() == 0) return;

        /* DIRECT LOSS MODEL*/
        // init the first weight vector
        Vector W = new Vector() {{put(0, 0.0);}}; 
        // model parameters
        ArrayList<Double> arguments = new ArrayList<Double>() {{add(0.1);add(-1.51);}}; 
        // task loss parameters
        ArrayList<Double> task_loss_params = new ArrayList<Double>(){{add(1.0);add(2.0);}}; 

        // build the model
        StructEDModel vowel_model = new StructEDModel(W, new DirectLoss(), new TaskLossVowelDuration(),
                new InferenceVowelDurationData(), null, new FeatureFunctionsVowelDuration(), arguments); 
        // train
        vowel_model.train(vowelTrainInstances, task_loss_params, null, epochNum, isAvg);
        // predict 
        vowel_model.predict(vowelTestInstances, task_loss_params, numExamples2Display); 
              


In this case, we used a different reader (lazy reader), and we stored our data in a data structure called it LazyInstanceContainer, this container gets the pathes to all the data and reads the actual data on demand. This data structure inherits from the original Instance container, hence we can use it we only a minot changes. Moreover, we store the raw data here as a 2D-Array, hence we create a new data structure called Examle2D which support this kind of data.