Create The Transformer With Tensorflow 2.0

Reading Time: 11 minutes

Hello everyone. It is now the greatest time of the year and here we are today, ready to to be amazed by Deep Learning.

Last time, we have gone through a neural machine translation project by using the renown Sequence-to-Sequence model empowered with Luong attention. As we already saw, introducing Attention Mechanisms helped improve the Seq2Seq model’s performance to a noticeably significant extent. For those who haven’t seen my last post, I recommend that you have a skim here)

However, given that we are either NLP practitioners or researchers, we might have heard of this all over the place:

Attention Is All You Need

Okay guys, let me present to you the one that we are all long for: The Transformer!

Overview

Speaking of Transformer, I didn’t mean the fancy bulky Optimus Prime, but this:

Figure 1: The Transformer (from paper)

Ew, that looks scary. And believe it or not, today, we are going to create the Transformer entirely from scratch. That seems impossible at first, I know it. But as you will see in a moment, with the help of Tensorflow 2.0 (and Keras at its core), building such a complicated model is no different from stacking up Lego pieces.

Concretely, today we will go through the steps below on the journey to create our own Transformer:

  • Create the simple-and-straight-forward version to understand how a Transformer works
  • Update the simple version to enhance speed and optimize GPU memory

Now, let’s take a look at a couple of things that are worth checking out in advance.

Prerequisites

As usual, to get the most out of this blog post, I highly recommend that the following should be done first:

Having done all of those above? Then we are ready to go. Let’s get started!

Note that by introducing the Transformer, I didn’t mean the Sequence-to-Sequence model sucks. For tasks like machine translation, the sequential characteristics of recurrent layers with the help of an attention mechanism is still capable of delivering a great result.

Create the Transformer – the simple version

This time, I felt the urge to show you guys the quick-and-dirty version, which was what I actually wrote in the beginning. I strongly believe that will help you guys to see how the paper was interpreted and follow along with ease.

With that being said, it’s time to talk business.

Just like the Seq2Seq model, the Transformer has two separate parts: the Encoder and the Decoder to deal with source sequences (English) and target sequences, respectively. Let’s take a closer look at the Encoder:

Figure 4: the Encoder (cut from paper)

As we can see from that sketch, the Encoder is made of four components:

  • Embedding
  • Positional Encoding
  • Multi-Head Attention
  • Position-wise Feed-Forward Network

That might sound exhausting. How can we create them all?

Although the Encoder consists of 5 different components, it turns out that we only need to create two of them: the Positional Encoding and the Multi-Head Attention.

Positional Encoding

We know that using RNN units are not efficient for their sequential characteristics so we got rid of them and, sadly, their ability to treat input data as sequences too (i.e A->B->C: C comes after B and B comes after A).

So, what do we do now? How about explicitly providing the absolute position information of each token within the sequence?

Usually, that kind of information that the model can learn from is usually called a feature. Yep, positional encoding is simply a feature! Here is how we can compute it:

It may seem a little bit scary, but is extremely easy to implement as-is in Python.

That is our positional encoding (features).

The Multi-Head Attention

And here we are at the core of the Transformer: the Multi-Head Attention. In fact, if you know how Luong attention mechanism works (which you should by now), this would be very easy to implement because they are pretty similar.

Typically, as far as we knew, the decoder output would draw attention to the encoder output to decide where to put more weight on.

Figure 6: Attention drawn to Encoder Output from Decoder Output

But that is no longer the case with the Transformer since it does not have the sequential constraint like the Sequence-to-Sequence architecture. Specifically, we can have three patterns like below:

  • Source sequence pays attention to itself (Encoder’s self attention)
  • Target sequence pays attention to itself (Decoder’s self attention)
  • Target sequence pays attention to source sequence (same as Seq2Seq)

which is why the authors introduced three new terms: query, key and value.

  • Query: the one which pays attention
  • (Key, Value): the one to which attention is drawn. Key and value are exactly the same within this post.

So now we are ready to take a look at the Multi-Head Attention:

Figure 7: Multi-Head Attention (cut from paper)

Again, it looks pretty complicated but in fact, the idea is pretty simple. The Scaled Dot-Product Attention is basically similar to Luong attention (dot score function) and we need to compute not one, but many of them simultaneously. Sounds complicated, doesn’t it?

Well, I always like to think that creating Deep Learning models is no different than assembling Lego pieces. Always keeping track of the shapes is key. Let’s take an insight into the Multi-Head Attention:

Figure 8: Insight into Multi-Head Attention

Everything is crystal clear now. Let’s code!

Firstly, we will create a class named MultiHeadAttention. The number of attention heads is controlled by h and remember that we must create separate Dense (Linear) layers for each head.

Next, we will implement the logic within the Multi-Head Attention. Let’s take a look at the special case first: One-Head (which is similar to Luong dot-score attention):

We need to compute h attention heads simultaneously this time. And the fastest way is to create a for loop. Don’t forget the additional Dense layers as illustrated above. Here is the Multi-Head version:

And the Multi-Head Attention is done. Now we have all the ingredients we need. Let’s go ahead and create the Encoder!

The Encoder

We are now able to create our Encoder. Let’s visualize the data flow inside the Encoder first:

Figure 9: Data Shapes inside the Encoder

Specifically, the Encoder consists of:

  • One Embedding layer
  • One or many layers, each layer contains:
    • One Multi-Head Attention block
    • One Normalization layer for Attention block
    • One Feed-Forward Network block
    • One Normalization layer for FFN block

With that, let’s go ahead and define the Encoder class:

Here you might be wondering: Shouldn’t we be using the LayerNorm thing? Well, when I started this project, LayerNorm hadn’t been implemented yet. Rather than creating one on my own, I decided to go with BatchNormalization and it worked just fine. You will see in a second.

Now it’s time for the forward pass. Starting with the Embedding layer, we will then add the positional encoding to its output (position-wise):

Next, we will implement the layers of Multi-Head Attention and Feed-Forward Network. For the Multi-Head Attention, we will loop through the input sequence’s length and compute its context vector towards the whole-length sequence:

Then we have a residual connection, followed by a Normalization layer (out = LayerNorm(out + in)):

And we are done with the Multi-Head Attention. The Feed-Forward Network coming next is much more straight-forward since it contains only Dense layers. Let’s do the same: compute the output, add the residual connection and normalize the result:

Below is the complete forward pass of the Encoder:

The Decoder

So, we have created the Encoder. We’ll go ahead to tackle the Decoder, which is very similar to the Encoder, except that … Well, let’s first take a look:

Figure 10: The Decoder

So basically, there is nothing that we haven’t covered yet. However, there are some differences to notice:

  • There are two Multi-Head Attention blocks in one layer, one for the target sequences and one for the Encoder’s output
  • The bottom Multi-Head Attention is masked

And as always, let’s visualize the data’s shapes inside the Decoder so that we can get ready to implement:

Figure 11: Data Shapes inside the Decoder

We can now create the Decoder class and define all the material we need:

Next, we will dive into the forward pass. The first step is similar to what we did in the Encoder: pass the sequences through the Embedding layer and add up the Positional Encoding information:

Then, we will have a for loop to create a bunch of layers as illustrated in Fig.11 above. In each layer, the first block is the Multi-Head Attention in which the target sequence draws attention to itself (self-attention). And as I mentioned above, this block needs to be masked. What does that mean?

Unlike source sequence on the Encoder’s side, each token in the target sequence should not be trained to depend on its neighbors to the right. Think of the inference phase, we begin with the <start> token and predict word after word, right? There is gonna be no hint on the right!

Implementing that masking mechanism is pretty easy. We just need to modify the code used in the Encoder a little bit:

Coming next is another Multi-Head Attention layer in which the query is the output of the Multi-Head Attention above and the value is the output of the Encoder. This is what we normally do with Seq2Seq architecture:

The last piece is the FFN layer, which is no different from the Encoder:

Oh, and don’t forget to use the last Dense layer to compute the final output:

Here is the full code of the Decoder’s forward pass:

And the two pieces of the Transformer are ready. Let’s test them out:

You can see that I’m setting the number of attention heads h=2. The reason for that is to test out the multi-head mechanism (setting h=1 is sufficient for this experiment). I also keep the number of layers low as we are going to train on a tiny dataset (20 English-French pairs).

The code above should print out something like below:

Okay, the output shape was good. Let’s now add some more necessary pieces to conduct the experiment: to overfit the 20 pairs of English-French sentences.

Data Preparation

We will start will the data preparation:

For a step-by-step instruction, please refer to my previous blog post on NMT and Luong attention mechanism.

Loss Function & Optimizer

The loss function also requires no modification from what we used: SparseCategoryCrossentropy with a mask to filter out padded tokens. And we are using AdamOptimizer with default setting:

Train Function

Again, the train_step function is pretty simple and similar to what we implemented before:

Predict Function

If you are getting bored and require something new, here it is. We have to make a small change to the predict function. What we used to do is as follows:

  1. Feed the <start> token into the model
  2. Take the last output as the predicted word
  3. Append the predicted word to the result
  4. Feed the predicted word and its associated state to the model and repeat step 2

We cannot do the same with the Transformer. Why? Because we lost the sequential mechanism, i.e. the state. Instead, here is how we are going to do:

  1. Feed the <start> token into the model
  2. Take the last output as the predicted word
  3. Append the predicted word to the result
  4. Feed the entire result into the model and repeat step 2

Let’s see an example:

  1. Feed “<start>
  2. The last word: “I”
  3. Feed “<start> I”
  4. The last word: “am”
  5. Feed “<start> I am”
  6. etc.
  7. Final result: “<start> I am Trung Tran . <end>

And here is the code:

Training loop

The final thing to do is to create the training loop as follows. We will train the model for 100 epochs and periodically print out the loss value as well as some translation result for monitoring purpose.

Looks like we have everything we need. Let’s start training!

In early stage, the model could only print out some meaningless phrases, which is absolutely normal.

But we don’t have to wait so long. Something cool started to appear after about 50 epochs:

The model kept learning until 100th epoch (we can tell by the loss value). Let’s have the model translate all 20 training sentence pairs:

The Transformer worked as expected and it only took ~ 80 epochs to overfit the tiny dataset, whereas the vanilla Seq2Seq needed ~ 250 epochs to do the same thing. Attention’s power is now confirmed! The model still made some weird translations and we will soon know the reason why.

So that was how we should interpret the paper and implement a quick-and-dirty version of the Transformer. In the next section, let’s see what we can do to improve the model’s performance.

Enhance the simple Transformer

RNN (of which Seq2Seq model is made) is known to be inefficient on GPU for its sequential mechanism. The Transformer, on the other hand, contains mainly matrix multiplication operators which means that it is supposed to be (super) fast on GPUs.

At this point, we already created a working Transformer which can overfit the tiny dataset. But it is slow and we should not plug the full training dataset in yet. Let’s see what we can do to improve our model.

Improve the Encoder

Let’s start by taking a look at the Encoder. There are two bottlenecks within the forward pass that slow things down: the for loop for the Embedding layer and the for loop to compute attention heads.

The first one is easy. The second one is a bit trickier.

In fact, that for loop was there to provide us a good understanding of how attention is being computed and make things easier to debug. With the current implementation of the Multi-Head Attention layer, we can just stuff the whole sequence in without affecting the result. We can rewrite the Encoder’s forward pass as follows:

Are we done with the Encoder yet? Well, there is one more change we should do about the padded zeros.

As you already know, we have to append zeros make all sequences equal in length. Those padded tokens basically have no meaning. And having our model (accidentally) pay attention to them is not a good thing at all. We haven’t done anything about it before and that was what caused the weird translation result.

The solution for that is to use a mask, which is 0 at padded tokens and 1 at anywhere else.

Because we need to use the above mask in the Decoder, we will create it later inside the train_step function.

So now we have a mask, let’s modify the Multi-Head Attention to adopt that change.

Add masking to Multi-Head Attention

The modification to make in order to apply the masking mechanism is pretty simple. Essentially, we want the masked positions to become 0 after applying softmax (i.e. zero attention). We can achieve that by assigning them to an extremely large negative value. Simple math, right?

The new forward pass of the Multi-Head Attention is as follows:

Improve the Decoder

In order to speed up the Decoder, we will basically do the same things as what we did with the Encoder, which is to get rid of the inefficient for loops.

Before diving into the code, let us recap a little bit. Do you remember that the Decoder has two Multi-Head Attention layers, which the bottom one does not allow any token to pay attention to the right-hand side? If we need to compute attentions for the whole sequence, we’re gonna need a mask:

That mask is very easy to implement. You can do in a pythonic way like this:

Or you can use a built-in Tensorflow function called band_part.

Both will produce a mask like above. Feel free to use the one that you like. And here is the new forward pass of the Decoder:

Next, as mentioned above, we need to modify the train_step function to generate a padding mask for the source sequence:

Now we are ready to train. Don’t forget to re-initialize the model to apply the changes we made:

We can see that the new model took approximately 3.76s for one epoch, whereas the old model needed 4.31s. It did cut out approximately 13% of the training time (I did this experiment on Colab so the numbers are not always the same).

Improve the Multi-Head Attention

Can we push it further? The answer is YES. There is one for loop left and it lies within the MultiHeadAttention class.

However, this one may not be as simple as the ones we tackled above. The difference is:

  • Above: one matrix multiplication operator ((B, L1, M) x (B, M, L2) => L1 times of (B, 1, M) x (B, M, L2))
  • This time: multiple matrix multiplication operators (H times of (B, L1, M) x (B, M, L2))

In fact, it is simpler than it sounds. The tf.matmul function is smart enough to retain not only the batch (0th) axis but also axes which are not directly involved in the dot operator.

What I meant by that is if we manage to have matrices of shapes (B, H, L1, M) and (B, H, M, L2) respectively, only calling tf.matmul once will produce the exact same result as the for loop. We may want to be extremely careful with the matrix transformation though. Here is the code illustrating the idea:

To apply the change above, we need to make some modification to the train_step function. Essentially, the masks used in the Multi-Head Attention layer must be broadcastable to the score’s shape, which is (batch_size, H, query_len, value_len). The look_left_only_mask has the shape of (query_len, value_len) so there should be no problem at all, whereas the padding_mask‘s shape is currently (batch_size, value_len), we need to explicitly turn it into (batch_size, 1, 1, value_len).

We are now ready to run the training process again and we shall see that everything is still working as before:

You might notice little to no improvement in the training speed. But believe me, once you plug in the real training data and use the full 8-head attention setup, the new MultiHeadAttention implementation will become a game changer!

Last but not least, let’s check out the translation result of all 20 training source sentences. A deep learning engineer should always be skeptical of everything:

And that is that! Everything worked flawlessly!

Final words

So we have finished implementing our own Transformer entirely from scratch. Starting off by creating a quick-and-dirty version while we were interpreting the paper, then we moved on to improve the model’s performance so that it’s production-ready. Be extremely cautious and take one baby step at a time, we can tackle literally every paper we felt interested in. Believe me.

You can find all the code related to this post below:

  • Colab notebook for this post: link
  • Colab notebook for full training, including attention heatmap visualization: link
  • Source code for full training data: link

Feel free to play with the code and give me some feedback, I would appreciate that. Thank you all for reading and we’re gonna see each other again very shortly.

Reference

  • Attention is all you need: link
  • Stanford NLP group’s material on Transformer: link

Trung Tran is a Deep Learning Engineer working in the car industry. His main daily job is to build deep learning models for autonomous driving projects, which varies from 2D/3D object detection to road scene segmentation. After office hours, he works on his personal projects which focus on Natural Language Processing and Reinforcement Learning. He loves to write technical blog posts, which helps spread his knowledge/experience to those who are struggling. Less pain, more gain.

30 comments On Create The Transformer With Tensorflow 2.0

  • Thanks for a great post. But, I have a simple question which is confusing me being new to Dl. How to save the model, say in Google Colab as this method doesn’t use a Tensorflow session and probably follows along Eager execution. I guess that train_step() is handling the training process, and suppose predict() is to be called on new test data, is there a suitable way to save the model/something similar so that it doesn’t have to be trained again from scratch? If you can suggest a way, it would be very helpful as I have searched widely for the method to save without a Session, but couldn’t find a simple solution. Thanks!

    • Trung Tran

      Hi Ramesh,

      You can save the weights directly to Google Drive and restore from that for inference.

      Of course, you will need to mount your Drive into Colab first. But that’s pretty simple.

      You can see a concrete example here: https://stackoverflow.com/questions/49031798/when-i-use-google-colaboratory-how-to-save-image-weights-in-my-google-drive
      (See the answer from Tadej Magajna)

      Cheers,
      Trung

      • Hi Trung,
        Thanks for your reply. This is interesting, I could now mount the Drive like this:-
        from google.colab import drive
        drive.mount(‘/content/drive’)

        However, the answer you mention by Tadej Magajna on SO unfortunately doesn’t go into details of saving the weights and restoring that from inference. Can you tell a simple way to do this, I mean save the weights, restore the latter for using predict() without requiring training from scratch? I regularly follow your posts like on Seq2Seq and this one on transformer etc., so would really appreciate a standard way of doing this for the models which do not use the sessions in Tensorflow. Thanks.

        • Trung Tran

          Hi Ramesh,

          Thank you for following my blog for a while. I really appreciate that πŸ™‚

          Back to the Colab & Drive thing, I don’t see any problems here (or maybe I don’t understand your question correctly).

          I think you can treat the mounted drive as a normal directory, and tell the model to save weights to that.
          For example: you can do something like: model.save_weights(‘/gdrive/My \Drive/your_model_name’) during training. So your weights are kept permanently on your Drive.
          For inference, you can either download the weights from where you stored and do it locally, or you can infer directly on Colab (the Drive is still mounted and you know exactly where the weights are, right?)

          Hope this helps.
          Trung

          • I see, this is how it works. In the Transformer post you have provided the way to save weights for the encoder-decoder so i was looking for something similar for the Seq2Seq model. Also, one of my observations is that the Transformer is taking more time to train than Seq2Seq for my data. Is this possible? Cheers

          • Trung Tran

            Hi Ramesh,

            If you want a complete source base, take a look here: https://github.com/ChunML/NLP/tree/master/chatbot. My chatbot project is more polished (still work-in-progress) with separate training/test scripts.

            About the training time though, it depends on the network’s settings so I can barely say anything.

            Regards,
            Trung

  • Hi there,
    Thanks for a great post. Can you tell how to stop training to prevent overfitting in this case? I mean we are not using a validation loss estimate are we? I am confused on where to stop training as the loss becomes 0 if I train for more than a few hundred epochs. Can you give an example to introduce early stopping in this code? Thank you.

  • Hi I have Few questions
    1:- What is modelsize :- is it the embedding size for each word in text which is 512 in paper?
    2:- Why for keysize you are dividing modelsize by number of attention heads :- cant we use any keysize ?
    3:- the only constraint is keysize = querysizie so that their dot product is possible?
    4:- query_len and value_len .they should be same right? . as they represent the length of words in a sequence

  • Hey ,
    I wanna say thanks . your kernel helped me a lot in understanding mathematics of transformer model. now i am able to write the complete model by myself

  • Hi,

    first of all, thank you for this article, it helped me a lot in understanding how transformer works and, specifically, how self-attention is implemented.

    Just one comment, I think I found a small error: In the transformer_31.py snippet, first row is going to be all zeros and I think that’s not the idea, right? At least I tried that code and it wasn’t working properly until I changed it slightly:

    look_left_only_mask = tf.constant([[1] * (i + 1) + [0] * (seq_len – i – 1) for i in range(seq_len)], dtype=tf.float32)

    Many thanks again!

    • Trung Tran

      Hi,

      Thank you for reading.

      Great question. Personally, I think it is unnecessary to include the current word in key & value tensors, since we can still have access to that piece of information thanks to the residual connection later on.

      In fact, I have tried both of them and gained slightly better results if the current word is masked out too.

  • Also, one question: does it make sense to multiply the look_left_only_mask and the padding mask in the decoder?

  • There is an error to your code on the train step when I try to run it says this:
    “TypeError: Failed to convert object of type to Tensor. Contents: [Dimension(None), -1, 8, 16]. Consider casting elements to a supported type.”

    Do you know how to fix this?

  • Hi dude,
    Thank you for the great post!

    I’m struggling to save the transformer model weights. Is there any way to save it just like you saved the encoder and decoder attentions weights and loaded back for training process in your previous post?
    I’m trying to use the saved weights to use it on another dataset.

    Many thanks.

  • Hi! Great post!

    I was able to run the entire project here, but realized that the prediction implementation takes a long time to execute if MAX_LENGTH is large and has several batches to run. (really long time)

    Do you have any suggestions to optimize, or if anyone has already done so?

    • Trung Tran

      Hi Arthur,

      As far as I know, it’s the trade off of having no states among steps, which results in the long inference time. You may want to have a look at newest papers on transformers (I’m not pretty much keeping up with the trends lately).

      • Hi again!

        I had some other questions about loss and a possible accuracy function.

        What is the “from_logit” parameter? is necessary? I saw it in other posts, but if I used it, the value of the loss will always be high and unique (maybe I’m missing something)… Another parameter is the reduction=”none” .. Can you explain their use?

        In the accuracy function, a reshape is used first.. but maybe needs a mask like in the loss function?

        y_true = tf.reshape (y_true, form = (- 1, MAX_LENGH))
        precision = tf.metrics.SparseCategoricalAccuracy () (y_true, y_pred)
        return accuracy

        Thanks a lot!

        • Trung Tran

          HI Arthur,

          1. from_logit
          Basically, the raw output from a network is called logit, without being applied softmax or sigmoid.
          Why does it matter? Because the loss function I used will not apply softmax on the input by default, so if you pass in the raw output of the network, you must tell the loss function that.

          2. reduction=”none”
          You can think of logit as a multidimensional vector, say (batch, step, depth), which means that the loss function will compute a loss of shape (batch, step). By default, the loss function will add it up and return you the just the sum (a scalar). That’s not what we wanted, which is why we have to tell it not to do that.

          3. accuracy
          I don’t think using merely the default accuracy metric is a good idea. Translation task is evaluated considering not only syntactic but also semantic aspect. You should read about some well-known metrics such as BLEU score.

          • Great explanation!
            In 3. i just try to make a simple text corrector, but ok, got it!

            Thank you again!

  • maybe some mistake in the function “positional_embedding” in this post.
    line 5 and 7
    PE[:, :, i] —> PE[:, i]
    the code in your github repo is PE[:, i].
    πŸ™‚

  • Hi,
    I really like this transformer implementation, its clean and simple to navigate.

    I think it would be really useful, if function def predict(test_source_text=None) could take a batch of inputs. For example function for training [def train_step] takes batches of inputs, so its quite fast, it would be amazing if you did the same for prediction.

    Because as any transformer its quite slow, and the execution time is the same for prediction 100 sentences or 1 sentence on GPU.

    Anyways, great work.

    • Trung Tran

      Hi Robert,

      Sorry I’m late. Thank you for your words.
      However, I don’t know whether it’s a good idea to infer 100 sentences at once.
      Of course, we can stuff them in one matrix, but since each sentence will have different length eventually (when they hit their end token), we would need to loop the final result again to clean each sentence.

      Anyway, it’s just my thought. Any feedback on that is welcome πŸ˜€

  • Hi! I liked everything very much. But why the SoftMax function is missing in the decoder implementation?

    • Trung Tran

      Hi Sergey,

      Nothing is missing my friend. The loss function will take care of applying softmax. The network only needs to compute the raw output, i.e. logits.

  • Hi Trung Tran

    Beautiful blog, this has greatly helped my work as a developing ML engineer. I have a cheeky question… How would one go about incorporating the predictions at each t-step with a beam search decoder?

    Kind regards,
    Dean

Leave a reply:

Your email address will not be published.