1. What is the main building block of a Transformer model?

2. Who introduced the Transformer architecture?

3. The original Transformer paper is titled:

4. Transformers replace RNNs primarily because:

5. What is “self-attention” in Transformers?

6. Multi-head attention allows the model to:

7. Positional encoding is needed in Transformers because:

8. The encoder in a Transformer outputs:

9. The decoder in a Transformer is primarily used for:

10. BERT is a Transformer model trained using:

11. GPT models are primarily:

12. In multi-head attention, queries, keys, and values are:

13. The Transformer model uses which activation function in feed-forward layers?

14. Layer normalization in Transformers is applied:

15. Transformer models are highly parallelizable because:

16. Vision Transformers (ViT) treat images as:

17. In Transformers, residual connections are used to:

18. The attention score in self-attention is computed using:

19. The softmax function in attention ensures:

20. Transformers have replaced RNNs in NLP mainly because:

21. Encoder-decoder Transformers are typically used for:

22. BERT uses which pretraining objective?

23. GPT models are trained using:

24. Cross-attention in decoder allows:

25. In Transformers, layer normalization is applied: