8 minute read

1. Introduction to Transformers

Image

Mid 2017, the time when Attention Is All You Need paper has published by Google brains & research team. In this paper, first time transformers network, a novel neural network architecture based on a self-attention mechanism was introduced. Paper reported that transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks.

It got more highlighted in late 2018, when Google AI language team, has used transformers network and pre-trained on huge corpus of raw wikitext called BERT. This paper has reported SOTA performance on GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks and Question Answering benchmark SQuAD v1.1.

So, after this it has proven that transformers has something we should give more focus and try to use in other domains like computer vision, recommendation, time series etc. And now today we have seen that researcher is using transformers model in most of the domains.

In this blog, we will more focus of transformers basics, like self-attention, multi-head attention, tokenization, fine-tuning etc.

We will use BART a transformer based model. This model architecture is exactly same as vanilla transformers (2017), it modify ReLU to GeLUs activation functions in feed-forward layer and sine/coise based positional embeddings to learned positional embedding.(if you are not understanding this don’t worry, will able to after this blog).

from transformers.models.bart.modeling_bart import *
from transformers import BartTokenizer
from tokenizers import ByteLevelBPETokenizer
import glob
import torch

2. Tokenization

Image

Tokenization is a process of splitting sentence into tokens. Token is single unit of information like word in sentence.

To build a tokenizer, we need to define the vocab size and few other parameters

Learn more about Tokenization

2.1 Training

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# files
input_files = glob.glob("raw_data/*.txt")
# Customize training
tokenizer.train(files=input_files, vocab_size=1000, min_frequency=2, special_tokens=[
    "<s>", #start of sentence
    "<pad>", #padding token
    "</s>", #end of sentence
    "<unk>", #unknown words will be assigned
    "<mask>", #used in self-training i.e. model pretraining
])
!mkdir sample_tokenizer
tokenizer.save_model("./sample_tokenizer")
['./sample_tokenizer/vocab.json', './sample_tokenizer/merges.txt']

2.2 Loading

# load tokenizer trained model
tokenizer = BartTokenizer.from_pretrained("sample_tokenizer/")
tokenizer
PreTrainedTokenizer(name_or_path='sample_tokenizer/', vocab_size=1000, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})

2.3 Tokenize

text = "Hi, I love NLP models."
tokenizer.tokenize(text)
['H', 'i', ',', 'ĠI', 'Ġl', 'o', 've', 'ĠNLP', 'Ġmodels', '.']

Ġ refers to a space, so that we can regenerate input text.

2.4 Encode Tokens into ids

encoded_ids = tokenizer.encode(text) # add_special_tokens=False
encoded_ids
[0, 44, 77, 16, 319, 330, 83, 374, 947, 854, 18, 2]
tokenizer.decode(encoded_ids)
'<s>Hi, I love NLP models.</s>'

<s> is the start of sentence and <s> end of the sentence.

3. Modeling

Image

This is the vanilla transformer Architecture.

3.1 Model Configuration

In transformers, there are multiple blocks and sub blocks i.e. encoder, decoder, multi-head attention etc.

In config, we define value for the paramters of blocks/sub-blocks.

# see default settings
#BartConfig()
config = BartConfig(encoder_layers=1, decoder_layers=1, vocab_size=1000)
config
BartConfig {
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 1,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 1,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "bart",
  "num_hidden_layers": 1,
  "pad_token_id": 1,
  "scale_embedding": false,
  "transformers_version": "4.23.1",
  "use_cache": true,
  "vocab_size": 1000
}

3.2 Build model from config

Lets build our first transformers model based on the above configuration.

model = BartModel(config=config)
model
BartModel(
  (shared): Embedding(1000, 1024, padding_idx=1)
  (encoder): BartEncoder(
    (embed_tokens): Embedding(1000, 1024, padding_idx=1)
    (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
    (layers): ModuleList(
      (0): BartEncoderLayer(
        (self_attn): BartAttention(
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (activation_fn): GELUActivation()
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
    )
    (layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): BartDecoder(
    (embed_tokens): Embedding(1000, 1024, padding_idx=1)
    (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
    (layers): ModuleList(
      (0): BartDecoderLayer(
        (self_attn): BartAttention(
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (activation_fn): GELUActivation()
        (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): BartAttention(
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
    )
    (layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
)
# encoded_ids is the tokenize and encoded id of text.(see above)
# input of batch_size=1
encoded_ids = torch.tensor([encoded_ids])
encoded_ids.shape
torch.Size([1, 12])
text_feature_embeddings = model(encoded_ids).last_hidden_state
text_feature_embeddings.shape
torch.Size([1, 12, 1024])

text_feature_embeddings is the feature representation of a given text input. We can use this embedding representation for our downstream tasks i.e. classification, generation, clustering etc.

4. Understand Model Layer one by one

Image

Lets try to decode, what is happening inside the transformers network one by one block

4.1 Input Pre-processing > Word Embeddings

First step is to build word embeddings module for representing tokens in embeddings. We have defined vocab_size=1000 in tokenization module as well as in config.

Word embedding block takes input encoded id and give word representative embedding for each ids of size 1024 vector(default, we can change via config)

# model.shared
model.encoder.embed_tokens
Embedding(1000, 1024, padding_idx=1)
word_embeddings = model.encoder.embed_tokens(encoded_ids)
word_embeddings.shape
torch.Size([1, 12, 1024])

4.2 Input Pre-processing > Position embeddings

As transformers take all input tokens simultaneously, not like RNN/LSTM/GRU. To capture the position of tokens, there is a position embedding block which give position embedding vector of each tokens.

There are multiple technique for position embedding i.e. sine+coise based, learned positional embedding etc.

Vanilla transformers uses sine+coise based positional embeddings but BART uses learned positional embedding.

model.encoder.embed_positions
BartLearnedPositionalEmbedding(1026, 1024)
pos_embeddings = model.encoder.embed_positions(encoded_ids)
pos_embeddings.shape
torch.Size([1, 12, 1024])

4.3 Input Pre-processing > Final Inputs

input_ids = word_embeddings + pos_embeddings
input_ids.shape
torch.Size([1, 12, 1024])

4.4 Encoder

#query porjection from input_ids
model.encoder.layers[0].self_attn.q_proj(input_ids).shape
torch.Size([1, 12, 1024])
#key porjection from input_ids
model.encoder.layers[0].self_attn.k_proj(input_ids).shape
torch.Size([1, 12, 1024])
#value porjection from input_ids
model.encoder.layers[0].self_attn.v_proj(input_ids).shape
torch.Size([1, 12, 1024])
# self-attention
print(model.encoder.layers[0].self_attn(input_ids)[0].shape)
encoder_self_attn = model.encoder.layers[0].self_attn(input_ids)[0]
encoder_self_attn
torch.Size([1, 12, 1024])

tensor([[[-0.0034,  0.0026,  0.0081,  ...,  0.0003,  0.0034, -0.0023],
         [-0.0034,  0.0026,  0.0081,  ...,  0.0003,  0.0034, -0.0023],
         [-0.0034,  0.0026,  0.0081,  ...,  0.0003,  0.0034, -0.0023],
         ...,
         [-0.0034,  0.0026,  0.0081,  ...,  0.0003,  0.0034, -0.0023],
         [-0.0034,  0.0026,  0.0081,  ...,  0.0003,  0.0034, -0.0023],
         [-0.0034,  0.0026,  0.0081,  ...,  0.0003,  0.0034, -0.0023]]],
       grad_fn=<ViewBackward0>)
# Add+Norms+feedforward
encoder_self_attn_norm = model.encoder.layers[0].self_attn_layer_norm(encoder_self_attn)
encoder_self_attn_norm = model.encoder.layers[0].activation_fn(encoder_self_attn_norm)
encoder_self_attn_norm = model.encoder.layers[0].fc1(encoder_self_attn_norm)
encoder_self_attn_norm = model.encoder.layers[0].fc2(encoder_self_attn_norm)
encoder_hidden_states = model.encoder.layers[0].final_layer_norm(encoder_self_attn_norm)
encoder_hidden_states.shape
torch.Size([1, 12, 1024])
model.encoder.layers[0]
BartEncoderLayer(
  (self_attn): BartAttention(
    (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
  )
  (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  (activation_fn): GELUActivation()
  (fc1): Linear(in_features=1024, out_features=4096, bias=True)
  (fc2): Linear(in_features=4096, out_features=1024, bias=True)
  (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)

4.5 Output Pre-processing

For decoder input, we right shift input tokens and pass through word embeddings as well as positional embeddings

encoded_ids
tensor([[  0,  44,  77,  16, 319, 330,  83, 374, 947, 854,  18,   2]])
decoder_inputs = shift_tokens_right(encoded_ids, pad_token_id=1, decoder_start_token_id=2)
decoder_inputs
tensor([[  2,   0,  44,  77,  16, 319, 330,  83, 374, 947, 854,  18]])
# model.shared
decoder_word_embeddings = model.decoder.embed_tokens(decoder_inputs)
decoder_word_embeddings.shape
torch.Size([1, 12, 1024])
decoder_pos_embeddings = model.decoder.embed_positions(decoder_inputs)
decoder_pos_embeddings.shape
torch.Size([1, 12, 1024])
decoder_input_ids = decoder_word_embeddings + decoder_pos_embeddings
decoder_input_ids.shape
torch.Size([1, 12, 1024])

4.6 Decoder

model.decoder.layers[0]
BartDecoderLayer(
  (self_attn): BartAttention(
    (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
  )
  (activation_fn): GELUActivation()
  (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  (encoder_attn): BartAttention(
    (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
  )
  (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  (fc1): Linear(in_features=1024, out_features=4096, bias=True)
  (fc2): Linear(in_features=4096, out_features=1024, bias=True)
  (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
# decoder_input-decoder_input self attention
decoder_self_attn = model.decoder.layers[0].self_attn(decoder_input_ids)[0]
decoder_self_attn.shape
torch.Size([1, 12, 1024])
# encoder-decoder attention i.e. cross attention
decoder_encoder_attn = model.decoder.layers[0].encoder_attn(hidden_states=decoder_self_attn,
                key_value_states=encoder_hidden_states)[0]
decoder_encoder_attn.shape
torch.Size([1, 12, 1024])

4.7 Output Post-Processing

# feature representation of a given input sentence
model_output_features = model(encoded_ids)
model_output_features.keys()
odict_keys(['last_hidden_state', 'past_key_values', 'encoder_last_hidden_state'])
# final feature representation
model_output_features.last_hidden_state
tensor([[[ 1.3585,  0.3638,  0.3431,  ...,  1.1773, -1.6471,  0.6394],
         [-0.5528,  1.0345,  0.1257,  ..., -0.3872, -1.4160,  0.4665],
         [-0.6704,  0.1079,  0.5481,  ..., -0.1823, -0.0264, -1.1273],
         ...,
         [ 1.3628,  1.1421,  0.8492,  ..., -0.4157,  1.0859, -1.2649],
         [-0.0734,  0.8632, -0.9869,  ..., -0.0866, -1.4402, -0.8505],
         [ 0.0052,  0.4603,  0.6000,  ..., -0.5838, -0.2336, -0.3519]]],
       grad_fn=<NativeLayerNormBackward0>)
# encoder last layer feature representation
model_output_features.encoder_last_hidden_state
tensor([[[ 1.0709, -1.0376, -1.3312,  ...,  0.7923, -0.5834, -0.0582],
         [-1.5943, -0.5630,  0.1977,  ...,  0.5616, -0.6423,  1.4250],
         [ 1.1451, -0.1073,  0.0637,  ..., -0.7085, -0.3052,  0.7960],
         ...,
         [-1.3538, -0.9423, -0.1416,  ...,  0.9372, -0.6864, -0.1389],
         [ 0.9959, -0.4869,  0.2692,  ...,  0.4765,  0.1438,  1.7434],
         [ 0.7082, -0.0759, -0.5594,  ...,  0.4646,  0.1940,  1.9280]]],
       grad_fn=<NativeLayerNormBackward0>)

4.7.1 Generative Language Model Task

Now, if you want to build a generative language model using transformers. You can take feature representation embedding from last layer of decoder and pass to the linear layer over vocab_size.

config.d_model, model.shared.num_embeddings
(1024, 1000)
# Language Model head
lm_head = torch.nn.Linear(config.d_model, model.shared.num_embeddings, bias=False)
lm_head
Linear(in_features=1024, out_features=1000, bias=False)
# [batch_size, max_token_len, d_model]
lm_logits = lm_head(model_output_features[0])
lm_logits.shape
torch.Size([1, 12, 1000])
# loss function
loss_fct = torch.nn.CrossEntropyLoss()
loss_fct
CrossEntropyLoss()
# next words prediction i.e. target label
labels = encoded_ids
labels.shape
torch.Size([1, 12])
generative_lm_loss = loss_fct(lm_logits.view(-1, config.vocab_size), labels.view(-1))
generative_lm_loss
tensor(7.0211, grad_fn=<NllLossBackward0>)

4.7.2 Sequence Classification Task

  • Single Label Classification i.e. multi class classification
  • Multi Label Classification
  • Regression
model_output_features.encoder_last_hidden_state.shape
torch.Size([1, 12, 1024])
# multi class classification
input_dim = 1024
inner_dim = 512
pooler_dropout=0.2
num_classes = 3

# target label
labels = torch.tensor([2])

# sentence representation from decoder last layer output
sentence_representation = model_output_features.encoder_last_hidden_state[:, -1, :]
# pooling layer
dense = torch.nn.Linear(input_dim, inner_dim)
dropout = torch.nn.Dropout(p=pooler_dropout)
#classification head
out_proj = torch.nn.Linear(inner_dim, num_classes)

sentence_representation = dense(sentence_representation)
sentence_representation = dropout(sentence_representation)
logits = out_proj(sentence_representation)

# loss function
loss_fct = torch.nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, num_classes), labels.view(-1))
loss
tensor(1.6008, grad_fn=<NllLossBackward0>)
# multi label
input_dim = 1024
inner_dim = 512
pooler_dropout=0.2
num_classes = 3

# target label
labels = torch.tensor([[0,1,0]], dtype=torch.float32)

# sentence representation from decoder last layer output
sentence_representation = model_output_features.encoder_last_hidden_state[:, -1, :]
# pooling layer
dense = torch.nn.Linear(input_dim, inner_dim)
dropout = torch.nn.Dropout(p=pooler_dropout)
# classification head
out_proj = torch.nn.Linear(inner_dim, num_classes)

sentence_representation = dense(sentence_representation)
sentence_representation = dropout(sentence_representation)
logits = out_proj(sentence_representation)

#loss function
loss_fct = torch.nn.BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
loss
tensor(0.8175, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
# regression
input_dim = 1024
inner_dim = 512
pooler_dropout=0.2

# target label
labels = torch.tensor([0.2], dtype=torch.float32)

# sentence representation from decoder last layer output
sentence_representation = model_output_features.encoder_last_hidden_state[:, -1, :]
# pooling layer
dense = torch.nn.Linear(input_dim, inner_dim)
dropout = torch.nn.Dropout(p=pooler_dropout)
# regression head
out_proj = torch.nn.Linear(inner_dim, 1)

sentence_representation = dense(sentence_representation)
sentence_representation = dropout(sentence_representation)
logits = out_proj(sentence_representation)

# loss function
loss_fct = torch.nn.MSELoss()
loss = loss_fct(logits.squeeze(), labels.squeeze())
loss
tensor(0.0389, grad_fn=<MseLossBackward0>)

5. References

  1. https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
  2. https://jalammar.github.io/illustrated-transformer/
  3. https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
  4. https://github.com/huggingface/transformers

6. Cited as

@article{kumar2022decodetransformers,
title   = "Decode the transformers network",
author  = "Kumar, Ankur",
journal = "ankur3107.github.io",
year    = "2022",
url     = "https://ankur3107.github.io/blogs/decode-the-transformers-network/"
}

Comments