# Practical 12 

This practical has two parts (jupyter notebooks), **Part I: Word embeddings** and **Part II: Transformers**.

# Part I: Word embeddings

This Jupyter Notebook consists of the following parts:
1. [Word Embeddings](#word_embeddings) 
 1. [Word2Vec](#word2vec) 
 2. [Word2Vec Architectures](#word2vec_arch) 
 3. [Word Embeddings Visualisation](#visual)
2. [Exploring Word Vectors with GloVe](glove)
 1. [Loading Word Vectors](#loading)
 2. [Finding Closest Vectors](#finding)
 3. [Word Analogies with Vector Arithmetic](#analogies)
3. [Motivation for Part II: Transformers](#transformers)


Check out a supplementary jupyter notebook for Part I, if interested in Skip-Gram implementation.

---

# Word embeddings


## Word2Vec

There are two classes of vector models: count-based (TF-IDF, Bag-of-Words) and neural-based. In this practical we will be focusing on neural word embeddings, i.e. word embeddings learned by a neural network.

**Main idea:** to use neural architectures that are predicting (not counting) the next word or a context of a given word.

One of the most known such models is **Word2Vec**. It is based on a neural network that is predicting the probability of a word given it's context. It was created by Mikolov et al. (2013). Here are the main papers on the topic:

* [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
* [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)

These vectors are ususally reffered to as **_distributed representations of words_** or **_word embeddings_**.

_As word embeddings are a key building block of deep learning models for NLP, word2vec is often assumed to belong to the same group. Technically however, word2vec is not be considered to be part of deep learning, as its architecture is neither deep nor uses non-linearities_ 

*a quote from [Sebastian Ruder's blog](https://ruder.io/word-embeddings-1/) 



## Word2Vec Architectures

There are two Word2Vec architectures: Skip-Gram and CBOW.

**Skip-Gram** predicts context words given the central word. Skip-Gram with negative sampling is the most popular approach.

**CBOW (Continuous Bag-of-Words)** predicts the central word from the sum of context vectors. This simple sum of word vectors is called "bag of words", which gives the name for the model.




*the image is taken for [Lena Voita's blog](https://lena-voita.github.io/nlp_course/word_embeddings.html#main_content) 
*if interested, please, check it out, it is very informative and illustrative

### How does it work?

Word2vec takes a large text corpus as input and maps each word to a vector, producing word coordinates as output. It first creates a dictionary by training on the input text data, and then calculates a vector representation of the words. The vector representation is learned on contextual proximity: words that occur in the text next to the same words (and therefore, according to the distributive hypothesis, have a similar meaning) will have close coordinates in the vector representation. 

To calculate the proximity of words, usually the cosine or euclidean distances between vectors are used.

Using distributed representations you can build semantic proportions (also known as analogies) and solve examples like:

*king: male = queen: female*
 $\Rightarrow$
*king - man + woman = queen*


## Word Embeddings Visualization

Go to https://projector.tensorflow.org/ and visualize Word2Vec embeddings. 

Original Word2Vec repository: https://code.google.com/archive/p/word2vec/

---

# Exploring Word Vectors with GloVe:

As we have seen, the word2vec algorithms (such as Skip-Gram) predicts words in a context (e.g. what is the most likely word to appear in "the cat ? the mouse"), while GloVe vectors are based on global counts across the corpus — [see How is GloVe different from word2vec?](https://www.quora.com/How-is-GloVe-different-from-word2vec) on Quora for some better explanations.

The best feature of GloVe is that multiple sets of pre-trained vectors are easily available for [download](https://nlp.stanford.edu/projects/glove/), so that's what we'll use here.

Part II of this notebook is taken from [practical-pytorch tutorials](https://github.com/spro/practical-pytorch/blob/master/glove-word-vectors/glove-word-vectors.ipynb).

## Installing torchtext

In [1]:
#! pip install torchtext


## Loading Word Vectors
Torchtext includes functions to download GloVe (and other) embeddings

In [1]:
import torch
from torchtext.vocab import GloVe

In [2]:
glove = GloVe(name='6B', dim=50)
print('Loaded {} words'.format(len(glove.itos)))

.vector_cache/glove.6B.zip: 862MB [02:42, 5.32MB/s] 
100%|█████████▉| 399999/400000 [00:07<00:00, 55651.76it/s]


Loaded 400000 words


Loaded 400000 words
The returned GloVe object includes attributes:

- stoi string-to-index returns a dictionary of words to indexes
- itos index-to-string returns an array of words by index
- vectors returns the actual vectors. To get a word vector get the index to get the vector:

In [3]:
def get_word(word):
 return glove.vectors[glove.stoi[word]]


## Finding Closest Vectors

Going from word → vector is easy enough, but to go from vector → word takes more work. Here I'm (naively) calculating the distance for each word in the vocabulary, and sorting based on that distance:

Anyone with a suggestion for optimizing this, please let me know!

In [4]:
from tqdm import tqdm_notebook as tqdm
def closest(vec, n=10):
 """
 Find the closest words for a given vector
 """
 all_dists = [(w, torch.dist(vec, get_word(w))) for w in tqdm(glove.itos)]
 return sorted(all_dists, key=lambda t: t[1])[:n]

This will return a list of (word, distance) tuple pairs. Here's a helper function to print that list:

In [5]:
def print_tuples(tuples):
 for tuple in tuples:
 print('(%.4f) %s' % (tuple[1], tuple[0]))

Now using a known word vector we can see which other vectors are closest:

In [8]:
print_tuples(closest(get_word('neuron')))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
 all_dists = [(w, torch.dist(vec, get_word(w))) for w in tqdm(glove.itos)]


 0%| | 0/400000 [00:00, ?it/s]

(0.0000) neuron
(3.6448) neuronal
(3.7174) presynaptic
(3.7488) synapse
(3.7969) excitability
(3.8072) neurons
(3.8454) synapses
(3.9334) pre-synaptic
(3.9602) axon
(4.0266) axons



## Word Analogies with Vector Arithmetic
The most interesting feature of a well-trained word vector space is that certain semantic relationships (beyond just close-ness of words) can be captured with regular vector arithmetic.



(image borrowed from a slide from [a slide from Omer Levy and Yoav Goldberg](https://levyomer.wordpress.com/2014/04/25/linguistic-regularities-in-sparse-and-explicit-word-representations/))

In [12]:
# In the form w1 : w2 :: w3 : ?
def analogy(w1, w2, w3, n=10, filter_given=True):
 # w2 - w1 + w3 = w4
 closest_words = closest(get_word(w2) - get_word(w1) + get_word(w3), n=n)
 print('\n[%s - %s + %s = ?]' % (w1, w2, w3))
 # Optionally filter out given words
 if filter_given:
 closest_words = [t for t in closest_words if t[0] not in [w1, w2, w3]]
 
 print_tuples(closest_words[:n])

In [13]:
analogy('king', 'man', 'queen')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
 all_dists = [(w, torch.dist(vec, get_word(w))) for w in tqdm(glove.itos)]


 0%| | 0/400000 [00:00, ?it/s]


[king - man + queen = ?]
(2.8391) woman
(3.3545) girl
(3.9518) boy
(4.0233) her
(4.0554) herself
(4.1365) she
(4.3874) blind
(4.4540) mother
(4.4820) lover


**Comment:** One of the applications of word embeddings is for example using them in the embedding layer of your model instead of using randomly initialised input that is being corrected during training. 

These pre-trained word embeddings (from Word2Vec, Glove, etc.) can either be kept static or modified during training.


---

# Motivation for Part II: Transformers

Word embeddings that we discussed above have one major limitation – they are **static**. This means that each word has a vector that does not change given different contexts.

While **contextualized** word embeddings can give words different embeddings based on the meaning they carry in the context of the sentence.

_"If we’re using this GloVe representation, then the word “stick” would be represented by this vector no-matter what the context was. “Wait a minute” said a number of NLP researchers (Peters et. al., 2017, McCann et. al., 2017, and yet again Peters et. al., 2018 in the ELMo paper ), “stick”” has multiple meanings depending on where it’s used. Why not give it an embedding based on the context it’s used in – to both capture the word meaning in that context as well as other contextual information?”. And so, contextualized word-embeddings were born..."_



*the image and the quote are taken from [Jay Alammar's blogpost](https://jalammar.github.io/illustrated-bert/) about BERT, ELMo and co.

Contextualized embeddings can be captured with Transformer-based models like BERT that are usually trained to predict randomly masked words. 
See **Part II** of this practical to learn more about transformers!