Quiz on Transformers

1. What is the main building block of a Transformer model?

Convolutional Layer Recurrent Layer Attention Mechanism Pooling Layer

2. Who introduced the Transformer architecture?

Geoffrey Hinton Yann LeCun Vaswani et al. Andrew Ng

3. The original Transformer paper is titled:

"Attention is All You Need" "Deep Learning for NLP" "Transformers in Vision" "Neural Networks Simplified"

4. Transformers replace RNNs primarily because:

They use convolution They can model long-range dependencies efficiently They require less data They are easier to implement in hardware

5. What is “self-attention” in Transformers?

A method to look at other sequences A mechanism to relate different positions in the same input sequence A pooling operation A way to normalize inputs

6. Multi-head attention allows the model to:

Reduce model size Avoid gradient vanishing Attend to information from different representation subspaces Perform convolution more efficiently

7. Positional encoding is needed in Transformers because:

Transformers are not sequence-aware by default It helps in optimization It reduces overfitting It normalizes input features

8. The encoder in a Transformer outputs:

Probability distributions Contextual representations of input tokens Positional encodings Loss values

9. The decoder in a Transformer is primarily used for:

Token classification Input embedding Feature extraction Generating output sequences

10. BERT is a Transformer model trained using:

Masked Language Modeling Sequence-to-sequence learning GANs Reinforcement Learning

11. GPT models are primarily:

Encoder-only Decoder-only Encoder-decoder Autoencoder

12. In multi-head attention, queries, keys, and values are:

Random vectors Hyperparameters Learned linear projections of input embeddings Activation functions

13. The Transformer model uses which activation function in feed-forward layers?

Sigmoid Tanh Leaky ReLU ReLU

14. Layer normalization in Transformers is applied:

Across batch Only at input Across features Only at output

15. Transformer models are highly parallelizable because:

They do not process sequences step by step They use recurrent connections They require fewer parameters They use convolution

16. Vision Transformers (ViT) treat images as:

Pixels in a grid Entire image vectors Non-overlapping patches Convolutional features

17. In Transformers, residual connections are used to:

Reduce model size Improve gradient flow Add regularization Reduce attention complexity

18. The attention score in self-attention is computed using:

Dot product of query and key Convolution of value Cosine similarity between embeddings L2 distance

19. The softmax function in attention ensures:

Normalized attention weights Sparsity of embeddings Non-linearity in feed-forward layers Position encoding

20. Transformers have replaced RNNs in NLP mainly because:

They are simpler They are more parallelizable They use convolution They require less memory

21. Encoder-decoder Transformers are typically used for:

Classification Sequence generation tasks Embedding visualization Clustering

22. BERT uses which pretraining objective?

Masked Language Modeling Next Sentence Prediction Both A and B None

23. GPT models are trained using:

Causal language modeling Masked language modeling Sequence-to-sequence modeling Autoencoding

24. Cross-attention in decoder allows:

Attention to itself Attention to the encoder outputs Attention to future tokens Attention to positional embeddings

25. In Transformers, layer normalization is applied:

Before attention and feed-forward layers Only at output Across batch only Not used