CS224N_2019_Assignment3: Dependency Parsing (Solution)

前言

A3作業讓你學會建立neural dependency parser的同時也能熟悉Pytorch的用法。
Written part是關於Adam和Dropout的解答與思考，這部分教授在課上解釋的比較少，但屬於neural network的重點之一，建議閱讀相關文獻加深這部分的理解。
Coding part是關於運用wrriten part的optimizer trick建立一個完整的simple neural net，並進行模型訓練。

題目詳情

– Written Part –

#1. Machine Learning & Neural Networks (8 points)

Answer：

( a )
i. Using m updates the gradient by multiplying it by α(1-β) times, reducing the gradient even further than SGD.

ii. v will get larger updates since its calculation contains the power of the gradients. If v is larger than 1, the updated v will be larger; if v is smaller than 1, the updated v will become smaller. This can help with learning by avoiding the learning rate being too large(exploding) or too small(vanishing) through the calculation of the division (√v).

( b )
i. $\frac{1}{1-p_{drop}}$ .
Since
$h_{drop} = γd⊙h$
∵ $h_{drop}=γ(1-p_{drop})⊙h=h$
∴ $γ(1-p_{drop})=1$
and we get $\frac{1}{1-p_{drop}}$ .

ii. Because dropout is a regularization form and used to prevent overfitting in order to achieve better generalization of the model.

#2. Neural Transition-Based Dependency Parsing (42 points)

( a )

Stack	Buffer	New Dependency	Transition
[ROOT,parsed,this]	[sentence,correctly]		SHIFT
[ROOT,parsed,this,sentence]	[correctly]		SHIFT
[ROOT,parsed,sentence]	[correctly]	sentence->this	LEFT ARC
[ROOT,parsed]	[correctly]	parse->sentence	RIGHT ARC
[ROOT,parsed,correctly]			SHIFT
[ROOT,parsed]		parsed->correctly	RIGHT ARC
[ROOT]		ROOT->paresed	RIGHT ARC

( b )
2n steps. For each dependency, there are two steps of transitions(SHIFT and LEFT ARC or RIGHT ARC)

– Coding Part –

( c )寫parser的transition機制。這裡需要注意的是sentence不允許有改動，sentence是list，所以要用到slicing/list()/copy function去copy list。關於不同的copy方式對比的參考：How to clone or copy a list? - stackoverflow

class PartialParse(object):
    def __init__(self, sentence):
        """Initializes this partial parse.

        @param sentence (list of str): The sentence to be parsed as a list of words.
                                        Your code should not modify the sentence.
        """
        # The sentence being parsed is kept for bookkeeping purposes. Do not alter it in your code.
        self.sentence = sentence

        ### YOUR CODE HERE (3 Lines)

        self.stack = ['ROOT']
        self.buffer = sentence[:]  #Use slicing to copy lists. With self.buffer = sentence, the assignment just copies the reference to the list, not the actual list.
        self.dependencies = []

        ### END YOUR CODE


    def parse_step(self, transition):
        """Performs a single parse step by applying the given transition to this partial parse

        @param transition (str): A string that equals "S", "LA", or "RA" representing the shift,
                                left-arc, and right-arc transitions. You can assume the provided
                                transition is a legal transition.
        """
        ### YOUR CODE HERE (~7-10 Lines)

        if transition == 'S':
            self.stack.append(self.buffer[0])
            self.buffer.remove(self.buffer[0])
        elif transition == 'LA':
            self.dependencies.append((self.stack[-1], self.stack[-2]))
            self.stack.remove(self.stack[-2])
        elif transition == 'RA':
            self.dependencies.append((self.stack[-2], self.stack[-1]))
            self.stack.remove(self.stack[-1])

        ### END YOUR CODE

檢驗結果: 在这里插入图片描述 ( d ) 根據PDF的algorithm1架構，寫一個parse sentence in minibatches的function。

def minibatch_parse(sentences, model, batch_size):
    """Parses a list of sentences in minibatches using a model.

    @param sentences (list of list of str): A list of sentences to be parsed
                                            (each sentence is a list of words and each word is of type string)
    @param model (ParserModel): The model that makes parsing decisions. It is assumed to have a function
                                model.predict(partial_parses) that takes in a list of PartialParses as input and
                                returns a list of transitions predicted for each parse. That is, after calling
                                    transitions = model.predict(partial_parses)
                                transitions[i] will be the next transition to apply to partial_parses[i].
    @param batch_size (int): The number of PartialParses to include in each minibatch


    @return dependencies (list of dependency lists): A list where each element is the dependencies
                                                    list for a parsed sentence. Ordering should be the
                                                    same as in sentences (i.e., dependencies[i] should
                                                    contain the parse for sentences[i]).
    """
    dependencies = []

    ### YOUR CODE HERE (~8-10 Lines)

    partial_parses = [PartialParse(s) for s in sentences]  # for each sentence in sentences, apply PartialParse() to get partial_parses
    unfinished_parses = partial_parses.copy()

    while len(unfinished_parses) != 0:
        minibatch = unfinished_parses[0:batch_size]  # extract mini-batch
        transitions = model.predict(minibatch)
        for i in range(len(minibatch)):
            minibatch[i].parse_step(transitions[i])  # for each single unit in minibatch, apply parse_step() to its transition
            if len(minibatch[i].buffer) == 0 and len(minibatch[i].stack) == 1:
                unfinished_parses.remove(minibatch[i])  # remove the unit in mini batch after completion

    dependencies.extend(p.dependencies for p in partial_parses)  # use dependencies of partial_parses directly because the minibatch is the shallow copy of partial_parses and both contains reference of the same objects.

    ### END YOUR CODE

    return dependencies

檢驗結果：
在这里插入图片描述 ( e ) 搭nn的不同Layer，照著document做沒什麼難度。

需要注意的是input size，根據PDF的解釋，先有parser提取出m個由a list of integer token構成的features，然後再找到每個詞對應的embedding，連接成一個 input vector： $x = [E_{w1},...,E_{wm}]$ ，所以 $x$ 的shape等於number of features(which is $m$ )*embedding size(which is the size of each $E_{w}$ )。

先寫好parser model.py：

class ParserModel(nn.Module):
    """ Feedforward neural network with an embedding layer and single hidden layer.
    The ParserModel will predict which transition should be applied to a
    given partial parse configuration.

    PyTorch Notes:
        - Note that "ParserModel" is a subclass of the "nn.Module" class. In PyTorch all neural networks
            are a subclass of this "nn.Module".
        - The "__init__" method is where you define all the layers and their respective parameters
            (embedding layers, linear layers, dropout layers, etc.).
        - "__init__" gets automatically called when you create a new instance of your class, e.g.
            when you write "m = ParserModel()".
        - Other methods of ParserModel can access variables that have "self." prefix. Thus,
            you should add the "self." prefix layers, values, etc. that you want to utilize
            in other ParserModel methods.
        - For further documentation on "nn.Module" please see https://pytorch.org/docs/stable/nn.html.
    """
    def __init__(self, embeddings, n_features=36,
        hidden_size=200, n_classes=3, dropout_prob=0.5):
        """ Initialize the parser model.

        @param embeddings (Tensor): word embeddings (num_words, embedding_size)
        @param n_features (int): number of input features
        @param hidden_size (int): number of hidden units
        @param n_classes (int): number of output classes
        @param dropout_prob (float): dropout probability
        """
        super(ParserModel, self).__init__()
        self.n_features = n_features
        self.n_classes = n_classes
        self.dropout_prob = dropout_prob
        self.embed_size = embeddings.shape[1]
        self.hidden_size = hidden_size
        self.pretrained_embeddings = nn.Embedding(embeddings.shape[0], self.embed_size)
        self.pretrained_embeddings.weight = nn.Parameter(torch.tensor(embeddings))

        ### YOUR CODE HERE (~5 Lines)

        self.embed_to_hidden = nn.Linear(self.n_features*self.embed_size, self.hidden_size) # According to PDF, input shape equals number of features multiplies by embedding size.
        nn.init.xavier_uniform_(self.embed_to_hidden.weight, gain=1)

        self.dropout = nn.Dropout(self.dropout_prob)

        self.hidden_to_logits = nn.Linear(self.hidden_size, self.n_classes)
        nn.init.xavier_uniform_(self.hidden_to_logits.weight, gain=1)

        ### END YOUR CODE

    def embedding_lookup(self, t):
        """ Utilize `self.pretrained_embeddings` to map input `t` from input tokens (integers)
            to embedding vectors.

            PyTorch Notes:
                - `self.pretrained_embeddings` is a torch.nn.Embedding object that we defined in __init__
                - Here `t` is a tensor where each row represents a list of features. Each feature is represented by an integer (input token).
                - In PyTorch the Embedding object, e.g. `self.pretrained_embeddings`, allows you to
                    go from an index to embedding. Please see the documentation (https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding)
                    to learn how to use `self.pretrained_embeddings` to extract the embeddings for your tensor `t`.

            @param t (Tensor): input tensor of tokens (batch_size, n_features)

            @return x (Tensor): tensor of embeddings for words represented in t
                                (batch_size, n_features * embed_size)
        """
        ### YOUR CODE HERE (~1-3 Lines)

        x = self.pretrained_embeddings(t)
        x = x.view(x.size(0), -1) # resize x into 2 dimensions.

        ### END YOUR CODE
        return x


    def forward(self, t):
        """ Run the model forward.

            Note that we will not apply the softmax function here because it is included in the loss function nn.CrossEntropyLoss

            PyTorch Notes:
                - Every nn.Module object (PyTorch model) has a `forward` function.
                - When you apply your nn.Module to an input tensor `t` this function is applied to the tensor.
                    For example, if you created an instance of your ParserModel and applied it to some `t` as follows,
                    the `forward` function would called on `t` and the result would be stored in the `output` variable:
                        model = ParserModel()
                        output = model(t) # this calls the forward function
                - For more details checkout: https://pytorch.org/docs/stable/nn.html#torch.nn.Module.forward

        @param t (Tensor): input tensor of tokens (batch_size, n_features)

        @return logits (Tensor): tensor of predictions (output after applying the layers of the network)
                                 without applying softmax (batch_size, n_classes)
        """
        ###  YOUR CODE HERE (~3-5 lines)

        e = self.embedding_lookup(t)
        x = self.embed_to_hidden(e)
        h = F.relu(x)
        d = self.dropout(h)
        logits = self.hidden_to_logits(d)

        ### END YOUR CODE
        return logits

再寫好run.py：

def train(parser, train_data, dev_data, output_path, batch_size=1024, n_epochs=10, lr=0.0005):
    """ Train the neural dependency parser.

    @param parser (Parser): Neural Dependency Parser
    @param train_data ():
    @param dev_data ():
    @param output_path (str): Path to which model weights and results are written.
    @param batch_size (int): Number of examples in a single batch
    @param n_epochs (int): Number of training epochs
    @param lr (float): Learning rate
    """
    best_dev_UAS = 0


    ### YOUR CODE HERE (~2-7 lines)

    optimizer = optim.Adam(parser.model.parameters(), lr)
    loss_func = nn.CrossEntropyLoss()

    ### END YOUR CODE

def train_for_epoch(parser, train_data, dev_data, optimizer, loss_func, batch_size):
    """ Train the neural dependency parser for single epoch.

    Note: In PyTorch we can signify train versus test and automatically have
    the Dropout Layer applied and removed, accordingly, by specifying
    whether we are training, `model.train()`, or evaluating, `model.eval()`

    @param parser (Parser): Neural Dependency Parser
    @param train_data ():
    @param dev_data ():
    @param optimizer (nn.Optimizer): Adam Optimizer
    @param loss_func (nn.CrossEntropyLoss): Cross Entropy Loss Function
    @param batch_size (int): batch size
    @param lr (float): learning rate

    @return dev_UAS (float): Unlabeled Attachment Score (UAS) for dev data
    """
    parser.model.train() # Places model in "train" mode, i.e. apply dropout layer
    n_minibatches = math.ceil(len(train_data) / batch_size)
    loss_meter = AverageMeter()

    with tqdm(total=(n_minibatches)) as prog:
        for i, (train_x, train_y) in enumerate(minibatches(train_data, batch_size)):
            optimizer.zero_grad()   # remove any baggage in the optimizer
            loss = 0. # store loss for this batch here
            train_x = torch.from_numpy(train_x).long()
            train_y = torch.from_numpy(train_y.nonzero()[1]).long()

            ### YOUR CODE HERE (~5-10 lines)
            
            logits = parser.model.forward(train_x)
            loss += loss_func(logits, train_y)
            loss.backward()
            optimizer.step()

            ### END YOUR CODE
            prog.update(1)
            loss_meter.update(loss.item())

用debug=True檢驗結果：averge loss=0.14, UAS=72.5
在这里插入图片描述

– END –