A deep-dive into RoPE, and why it matters?

What is Positional Encoding and why it matters? When training any large language model based on Transformers architecture, our input token sequences tend to form a $\text{seq\_len} \times \text{seq\_len}$ dimension Attention network where, the positional information between tokens aren’t preserved natively, it’s simply the representation of attention scores between each token (normalized by the $\sqrt{\text{dim\_len}}$). In order to preserve positional information such that the network learns differently about “Dog attacks the Cat” and “Cat attacks the Dog”, we add a deterministic (remains the same throughout the network) encoding for each position across the dimensional embedding at each position. ...

July 13, 2025 · 6 min

Decoding Karpathy's min-char-rnn (character level Recurrent Neural Network)

Recurrent Neural Networks (RNN) have existed for long at this point, and RNNs without attention mechanism (plain-simple RNN architecture) are no longer the hottest thing either. Still, RNN represents one of the first step towards understanding training for sequential data input, where the context of previous inputs are crucial for predicting the next output. Karpathy’s introduction to the The Unreasonable Effectiveness of Recurrent Neural Networks along with the attached Character Level RNN are among the best resources to get started with RNN. However, as I progressed my way through the implementation of min-char-rnn, I realized that while the blogpost suffices for an intuitive understanding of RNN, and the code works through the implementation from scratch, a lot of heavy-lifting in terms of manual backpropagation implementation, and flow of gradient during training to update the weights and parameters of the models are left for the readers to understand on their own. ...

October 5, 2024 · 7 min