Understanding Attention Mechanisms – Part 2: Comparing Encoder and Decoder Outputs
In the previous article, we explored the main idea of attention and the modifications it requires in an encoder–decoder model. Now, we will explore that idea further. An encoder–decoder model can b...

Source: DEV Community
In the previous article, we explored the main idea of attention and the modifications it requires in an encoder–decoder model. Now, we will explore that idea further. An encoder–decoder model can be as simple as an embedding layer attached to a single LSTM. If we want a more advanced encoder, we can add additional LSTM cells. Now, we initialize the long-term and short-term memory in the LSTMs of the encoder with zeros. If our input sentence, which we want to translate into Spanish, is "Let's go", we can feed a 1 for "Let's" into the embedding layer, unroll the network, and then feed a 1 for "go" into the embedding layer. This process creates the context vector, which we use to initialize a separate set of LSTM cells in the decoder. All of the input is compressed into the context vector. But the idea of attention is that each step in the decoder should have direct access to the inputs. So, let’s understand how attention connects the inputs to each step of the decoder. In this example, t