Experiments
Aggregated experiments and results
Table of Contents
Autoencoding
Vanilla Autoencoding
VQ-AE-GAN
Geometric Scene Similarity
Simple Shape Counting
tl;dr: Perhaps somewhat unsurprisingly, the model learns to count via token-number association if we restrict its vocabulary to \(N\) tokens and the number of objects on a screen \(\in [1, N]\).
All experiments use a language defined by {seq_len=1, vocab_size=10}
; a dataset defined by 64x64x3
images, outline and rotation enabled, and {min_shapes=1, max_shapes=10}
; and a model trained for 1000 batches of 256 samples with a DLSM architecture (see specifics in the linked full results).
Variation 1: Single Shape Counting. There is only one shape (square) and one color (red). Reaches 0.106625 BCE. Each of the 10 tokens becomes reliably associated with a certain number of objects from 1 to 10.
Full results here.
Variation 2: Varied Shape Counting. There are three shapes (circle, square, triangle) and one color (red). Reaches 0.267393 BCE. Each of the 10 tokens becomes reliably associated with a certain number of objects from 1 to 10. As expected, there is some error for larger numbers of objects due to overlap.
Full results here.
Variation 3: General Object Counting. There are three shapes (circle, square, triangle) and three colors (red, green, blue). Reaches 0.340383 BCE. Each of the 10 tokens becomes somewhat reliably associated with a certain number of objects from 1 to 10. There is more error in exact counting for large shape numbers as in Variation 2, but some of the counting is imperfect for smaller object counts, too.
Full results here.
Pushing the Limits of Language
What is the relationship between a combination of permitted {vocabulary size, sequence length} and the performance?
Preliminary findings:
- 4 tokens seems to be the minimum vocabulary size for decent performance (without varying length).
- Increasing sequence length can actually have deleterious effects
- The model generally performs well when the vocabulary size is large and the sequence length is small
Progressive Language Expansion
Idea: begin with a very simple setup (e.g. just blue squares), then slowly introduce new attributes (e.g. blue squares, triangles, circles; then all combinations of {blue, red, blue} and {squares, triangles, circles}) and observe if language is retained and how it adapts to new environmental stimulus.
Alec Mode
Vanilla Alec Mode
Strong Alec Mode
Variable Length Sequences
Out of Distribution Prediction
- The network seems to be capable of generating novel tokens.
Complicating the Generation Unit
- Using LSTMs vastly outperforms GRUs. To use LSTMs, set the cell state to the generated image latent vector \(z\) (output of the visual unit) and the initial hidden state to a random vector (‘sampling’ speech).
- Using Bidirectional LSTMs yields slightly better performance. Still need to test impact on quality of generated words.
Complicating the Visual Unit
- Increasing the number of convolutional layers helps improve performance.
- Capsule network - too complicated and not worth it as it takes up all the task that the language is supposed to perform. Maybe a good benchmark though.
Softmax-Argmax Quantizer
- Using the Softmax-Argmax sampler with a double-LSTM speaker and listener performs as well as using the VQ-VAE-style quantizer. There is a slightly larger number of unique generated sequences. Still need to test language quality.
- Softmax-argmax quantizer reaches godly performance on a 3-shape-type, 3-color, 3-objects Alec mode task.
Random Sampler
Gumbel-Softmax Sampler
Success!
Sparsity Restraint on Visual Unit
Output must be somewhat sparse.