# Part I: Word embeddings [Supplementary material]


## The Skip-Gram Model
Let's implement the Skip-Gram model using pytorch and then explore these word representations.

**NOTE:** this notebook requires you to install pytorch. https://anaconda.org/pytorch/pytorch

![skip-gram.png](attachment:skip-gram.png)

In [None]:
#! pip install torch

Lets start with a simple corpus:

In [2]:
corpus = [
 'he is a king',
 'she is a queen',
 'she is mad',
 'she is in love',
 'a mountain falls',
 'paris is france capital', 
]

In [3]:
def tokenize_corpus(corpus):
 tokens = [x.split() for x in corpus]
 return tokens

In [4]:
tokenized_corpus = tokenize_corpus(corpus)
vocabulary = {word for doc in tokenized_corpus for word in doc}
word2idx = {w:idx for (idx, w) in enumerate(vocabulary)}

In [5]:
word2idx

{'is': 0,
 'paris': 1,
 'capital': 2,
 'she': 3,
 'falls': 4,
 'in': 5,
 'love': 6,
 'france': 7,
 'he': 8,
 'queen': 9,
 'mad': 10,
 'mountain': 11,
 'king': 12,
 'a': 13}

We want to build pairs of words that appear inside the same context.
![samples.png](attachment:samples.png)

In [6]:
import numpy as np
def build_training(tokenized_corpus, word2idx, window_size=2):
 window_size = 2
 idx_pairs = []
 
 # for each sentence
 for sentence in tokenized_corpus:
 indices = [word2idx[word] for word in sentence]
 # for each word, threated as center word
 for center_word_pos in range(len(indices)):
 # for each window position
 for w in range(-window_size, window_size + 1):
 context_word_pos = center_word_pos + w
 # make soure not jump out sentence
 if context_word_pos < 0 or \
 context_word_pos >= len(indices) or \
 center_word_pos == context_word_pos:
 continue 
 context_word_idx = indices[context_word_pos]
 idx_pairs.append((indices[center_word_pos], context_word_idx))
 return np.array(idx_pairs)

In [7]:
training_pairs = build_training(tokenized_corpus, word2idx)

In [8]:
training_pairs

array([[ 8, 0],
 [ 8, 13],
 [ 0, 8],
 [ 0, 13],
 [ 0, 12],
 [13, 8],
 [13, 0],
 [13, 12],
 [12, 0],
 [12, 13],
 [ 3, 0],
 [ 3, 13],
 [ 0, 3],
 [ 0, 13],
 [ 0, 9],
 [13, 3],
 [13, 0],
 [13, 9],
 [ 9, 0],
 [ 9, 13],
 [ 3, 0],
 [ 3, 10],
 [ 0, 3],
 [ 0, 10],
 [10, 3],
 [10, 0],
 [ 3, 0],
 [ 3, 5],
 [ 0, 3],
 [ 0, 5],
 [ 0, 6],
 [ 5, 3],
 [ 5, 0],
 [ 5, 6],
 [ 6, 0],
 [ 6, 5],
 [13, 11],
 [13, 4],
 [11, 13],
 [11, 4],
 [ 4, 13],
 [ 4, 11],
 [ 1, 0],
 [ 1, 7],
 [ 0, 1],
 [ 0, 7],
 [ 0, 2],
 [ 7, 1],
 [ 7, 0],
 [ 7, 2],
 [ 2, 0],
 [ 2, 7]])

In [9]:
from tqdm import tqdm_notebook as tqdm
from torch.autograd import Variable
import torch.nn.functional as F
import torch

def get_onehot_vector(word_idx, vocabulary):
 x = torch.zeros(len(vocabulary)).float()
 x[word_idx] = 1.0
 return x

def Skip_Gram(training_pairs, vocabulary, embedding_dims=5, learning_rate=0.001, epochs=10):
 torch.manual_seed(3)
 W1 = Variable(torch.randn(embedding_dims, len(vocabulary)).float(), requires_grad=True)
 W2 = Variable(torch.randn(len(vocabulary), embedding_dims).float(), requires_grad=True)
 losses = []
 for epo in tqdm(range(epochs)):
 loss_val = 0
 for input_word, target in training_pairs:
 x = Variable(get_onehot_vector(input_word, vocabulary)).float()
 y_true = Variable(torch.from_numpy(np.array([target])).long())

 # Matrix multiplication to obtain the input word embedding
 z1 = torch.matmul(W1, x)
 
 # Matrix multiplication to obtain the z score for each word
 z2 = torch.matmul(W2, z1)
 
 # Apply Log and softmax functions
 log_softmax = F.log_softmax(z2, dim=0)
 # Compute the negative-log-likelihood loss
 loss = F.nll_loss(log_softmax.view(1,-1), y_true)
 loss_val += loss.item()
 
 # compute the gradient in function of the error
 loss.backward() 
 
 # Update your embeddings
 W1.data -= learning_rate * W1.grad.data
 W2.data -= learning_rate * W2.grad.data

 W1.grad.data.zero_()
 W2.grad.data.zero_()
 
 losses.append(loss_val/len(training_pairs))
 
 return W1, W2, losses

In [10]:
plot_loss(losses)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
 for epo in tqdm(range(epochs)):


 0%| | 0/1000 [00:00

## Final Embedding Matrix:
Based on the various sources the Skip-Gram model can be implemented in 2 different ways:

- With shared parameters, meaning that, W1 and W2 are the same matrix.
- Without shared paramenters accross layers, meaning that, in the end, we have two different matrixes with weights. The final matrix W is the average of both matrixes.

In [13]:
W = W1 + torch.t(W2)
W = (torch.t(W)/2).clone().detach()

In [14]:
W[word2idx["she"]], W[word2idx["mad"]]

(tensor([-1.5032, 0.5722, -0.2601, 0.4751, 0.0714]),
 tensor([-1.0277, 0.3649, -0.1252, 0.6154, 0.3178]))

In [15]:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances([W[word2idx["she"]].numpy()], [W[word2idx["falls"]].numpy()])

array([[2.9443595]], dtype=float32)

In [16]:
euclidean_distances([W[word2idx["she"]].numpy()], [W[word2idx["mad"]].numpy()])

array([[0.6063882]], dtype=float32)

As you can see from the previous example the vector representing "she" and the vector representing "mad" are closer then the vector representing "she" and "falls". This happens because "she" and "falls" never appear together inside the same context window...

### Exercise:
Go back to the Skip-Gram function and change it in order to have only 1 matrix of weights instead of 2. 
Run the training again and comment the results.



In [17]:
def Skip_Gram(training_pairs, vocabulary, embedding_dims=5, learning_rate=0.001, epochs=10):
 torch.manual_seed(3)
 W1 = Variable(torch.randn(embedding_dims, len(vocabulary)).float(), requires_grad=True)
 losses = []
 for epo in tqdm(range(epochs)):
 loss_val = 0
 for input_word, target in training_pairs:
 x = Variable(get_onehot_vector(input_word, vocabulary)).float()
 y_true = Variable(torch.from_numpy(np.array([target])).long())

 # Matrix multiplication to obtain the input word embedding
 z1 = torch.matmul(W1, x)
 
 # Matrix multiplication to obtain the z score for each word
 z2 = torch.matmul(torch.transpose(W1, 0, 1), z1)
 
 # Apply Log and softmax functions
 log_softmax = F.log_softmax(z2, dim=0)
 # Compute the negative-log-likelihood loss
 loss = F.nll_loss(log_softmax.view(1,-1), y_true)
 loss_val += loss.item()
 
 # compute the gradient in function of the error
 loss.backward() 
 
 # Update your embeddings
 W1.data -= learning_rate * W1.grad.data
 W2.data -= learning_rate * W2.grad.data

 W1.grad.data.zero_()
 W2.grad.data.zero_()
 
 losses.append(loss_val/len(training_pairs))
 
 return W1, losses

In [18]:
plot_loss(losses)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
 for epo in tqdm(range(epochs)):


 0%| | 0/1000 [00:00

In [None]:
# W1.T[word2idx["she"]], W1.T[word2idx["mad"]]

In [None]:
# euclidean_distances([W1.T[word2idx["she"]].detach().numpy()], [W1.T[word2idx["falls"]].detach().numpy()])

In [None]:
# euclidean_distances([W1.T[word2idx["she"]].detach().numpy()], [W1.T[word2idx["mad"]].detach().numpy()])