What is Soft vs Hard Attention Model in Computer Vision?
Before beginning this article on soft attention and hard attention, the author commends the readers’ resolve to make their models more robust and well-performing that they have now started to look at a technique that we have all been using in our daily lives as humans and has now been replicated for machines as well. Attention models help the machines focus on important things (as the name suggests, being attentive to the most crucial parts of the image) and then extract the features to deliver the final predictions. Look through this article to understand the intuition and the theoretical aspect of the attention modeling technique, the difference between soft vs hard attention models, and the way to implement them.
Introduction
In computer vision, a significant problem is the problem of working with large images. This problem does not occur with the size of the image in bytes but appears in the size of the image in its area. To utilize a convolutional neural network to the best extent, information must be extracted from the images being used in the best possible manner.
To tackle this problem, a huge image can be broken down into parts, and instead of analyzing the image as a whole, features can be examined, and then the model can predict the label more accurately.
In the famous image-captioning example, a convolutional neural network is used to extract the features of the image, which are then fed into a recurrent neural network that predicts the best possible caption for the image in question. The recurrent neural network here looks at the image as a whole, and this model will work if the image has visibly distinct subjects in question—for example, a water bottle on the table. However, in a busy image like a shot of the popular Indian Kumbh Mela, it is hard to recognize all portions of the image by looking at it at one glance or looking at it as a whole.
In the recurrent neural networks model, the model is initiated with a hidden state. Then distributions over the vocabulary are calculated by utilizing the previous step’s hidden state, which is then used to predict the next word. The idea now is to not only utilize the learnings of the last step but also look at different portions of the image in every step.
Soft Attention Model with RNN
Consider you have an RGB image with height H and width W. When just CNN is applied to this image before feeding the features to the RNN model, spatial information is lost, and only a 1-D vector is sent as the features of the image. In the attention model, this is changed, and the features of the image are now a 2-D vector which will be fed to the RNN model. Earlier, for our RNN model, we fed the previous hidden state and the image into the model. Now we will be feeding a 2-D vector with spatial information and parts of the image that will figure out the distribution over a localized portion of the image.
This distribution is combined with the features of the localized portion. Then weighted features are generated, which become the basis for the next hidden state along with the previous hidden state. This distribution over each grid section is calculated, and the sum of all these distributions will be 1. The dot product of the feature grid and the localized distribution will give soft attention “z.”
The derivative dz/dp will form the basis of the gradient descent that our model will utilize.
Hard Attention Model with RNN
Continuing the example of image captioning stated above, in hard attention, one location is taken as the basis of the model, which causes the derivative dz/dp to tend to zero, and hence gradient descent is rendered inefficient. For this case, reinforcement learning is more suitable.
Soft Attention vs. Hard Attention
Soft Attention | Hard Attention |
---|---|
The analysis is performed on different sub-regions. | The analysis is performed on one subregion. |
The soft attention model is discrete. | The hard attention model is random. |
Soft attention utilizes gradient descent and back-propagation, making it easier to implement. | Hard attention uses stochastic models like the Monte Carlo Method and reinforcement learning, making it less popular. |
Use Cases of Attention Models
Attention models are used where it is crucial to assess the environment/image/text and figure out an essential portion of the information that will be used to recognize the image/text/speech.
- Text recognition: It can be utilized for models where text recognition is needed. Some examples include translations, generating summaries, creating a QnA model from textbooks, etc.
- Image recognition: As stated in the sample taken in this article, it can be used for image captioning and also can be used in real-time visual recognition like CCTV cameras and also self-driving cars.
- Speech recognition: These attention models can also be applied to speech recognition models, which focus on specific portions of the speech to get a thorough and accurate analysis of the words being spoken by the user. Virtual assistants can be made more precise by utilizing attention models.
Implementing Attention Models
Attention Model = RNN + Encoder-Decoder Sequence-to-Sequence Model
The inputs to the attention model comprise only those features relevant to the classification/prediction output.
After loading, cleaning, and preprocessing the dataset, initialize the model. Model initialization will require the setting of parameters. It is assumed that the readers of this article are aware of the model initialization process. It is highly recommended to check out the importance of model parameters while initialization if you are unaware. Extract features from the image and passes them through an encoder, and these encoded features will be decoded during image captioning.
The decoders used for this problem can be either pure sampling, top-k sampling, greedy search, or even beam search.
The evaluation method for this model will be BLEU (Bilingual Evaluation Understudy). BLEU score assesses how well a text has been translated effectively with natural languages. We are using the BLEU score for our image captioning model to determine the quality of the caption. It is relatively easy to use and gives a number between 0 and 1 as the score.
Conclusion
Drawing the contrast between how humans understand their environment and how our attention models do, we realize that a better description of an image can be provided if we focus on the critical details presented in the image. Attention models help us recognize those crucial features and enable the model to predict accurate labels for the image and provide better captions. In this article, we understood the difference between soft and hard attention, which is used with convolutional neural networks, their use case, and their significant differences.
Readers are encouraged to try out the implementation of the attention models and compare the accuracy with a general RNN model and an RNN model with attention. For further reading, please read the “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention“ paper by Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. This paper is a benchmark for the various work undertaken in image captioning, object detection, and attention models.
Read more about learning paths at codedamn here.
Happy Learning!
Sharing is caring
Did you like what Pooja Gera wrote? Thank them for their work by sharing it on social media.