AI & Data Science2026 · MSc Data Science @ UTS

Image Captioning with CNN + LSTM

Two PyTorch image captioning architectures trained on the VizWiz dataset, comparing a lightweight MobileNetV3 + LSTM baseline against a GoogLeNet + LSTM with Luong spatial attention, evaluated with BLEU-1/2/3/4 and ROUGE-L metrics.

Github Repo

Built for the Deep Learning subject at UTS, this project required designing two image captioning architectures and evaluating them on the VizWiz validation set, 7,750 images taken by people who are blind, often blurry, poorly framed, or poorly lit. Model 1 uses MobileNetV3-Small as the CNN encoder with a single-layer LSTM decoder following the Show and Tell baseline (Vinyals et al., 2015). Features are globally average pooled to 576 dimensions and projected to 256. Partial unfreezing (features[9:12] trainable) allows the higher-level blocks to adapt to the VizWiz domain. Beam search with k=3 is used at inference. Model 2 replaces the encoder with GoogLeNet (Inception v1), exposing a full 7x7 spatial grid of 49 patches projected to 512 dimensions, and adds Luong General attention over those patches so the decoder can focus on relevant image regions at each decoding step.

Despite Model 2's stronger architecture, Model 1 outperformed it across all metrics due to a better-conditioned optimisation problem: Model 2 had 3.2x more trainable encoder parameters and overfitted by epoch 6 within the fixed 10-epoch budget. The performance gap narrowed consistently at higher n-gram orders (BLEU-4 gap: 0.006), suggesting the attention mechanism was working as intended but needed more training time to realise its full benefit.

Tech stack

PythonPyTorchCNNLSTMBLEU