Experiments

Aggregated experiments and results

Autoencoding
1. Vanilla Autoencoding
2. VQ-AE-GAN
Geometric Scene Similarity

Autoencoding

Vanilla Autoencoding

VQ-AE-GAN

Geometric Scene Similarity

Simple Shape Counting

tl;dr: Perhaps somewhat unsurprisingly, the model learns to count via token-number association if we restrict its vocabulary to \(N\) tokens and the number of objects on a screen \(\in [1, N]\).

All experiments use a language defined by {seq_len=1, vocab_size=10}; a dataset defined by 64x64x3 images, outline and rotation enabled, and {min_shapes=1, max_shapes=10}; and a model trained for 1000 batches of 256 samples with a DLSM architecture (see specifics in the linked full results).

Variation 1: Single Shape Counting. There is only one shape (square) and one color (red). Reaches 0.106625 BCE. Each of the 10 tokens becomes reliably associated with a certain number of objects from 1 to 10.

Full results here.

Variation 2: Varied Shape Counting. There are three shapes (circle, square, triangle) and one color (red). Reaches 0.267393 BCE. Each of the 10 tokens becomes reliably associated with a certain number of objects from 1 to 10. As expected, there is some error for larger numbers of objects due to overlap.

Full results here.

Variation 3: General Object Counting. There are three shapes (circle, square, triangle) and three colors (red, green, blue). Reaches 0.340383 BCE. Each of the 10 tokens becomes somewhat reliably associated with a certain number of objects from 1 to 10. There is more error in exact counting for large shape numbers as in Variation 2, but some of the counting is imperfect for smaller object counts, too.

Full results here.

Pushing the Limits of Language

What is the relationship between a combination of permitted {vocabulary size, sequence length} and the performance?

Preliminary findings:

4 tokens seems to be the minimum vocabulary size for decent performance (without varying length).
Increasing sequence length can actually have deleterious effects
The model generally performs well when the vocabulary size is large and the sequence length is small

Progressive Language Expansion

Idea: begin with a very simple setup (e.g. just blue squares), then slowly introduce new attributes (e.g. blue squares, triangles, circles; then all combinations of {blue, red, blue} and {squares, triangles, circles}) and observe if language is retained and how it adapts to new environmental stimulus.

Alec Mode

Vanilla Alec Mode

Strong Alec Mode

Variable Length Sequences

Out of Distribution Prediction

The network seems to be capable of generating novel tokens.

Complicating the Generation Unit

Using LSTMs vastly outperforms GRUs. To use LSTMs, set the cell state to the generated image latent vector \(z\) (output of the visual unit) and the initial hidden state to a random vector (‘sampling’ speech).
Using Bidirectional LSTMs yields slightly better performance. Still need to test impact on quality of generated words.

Complicating the Visual Unit

Increasing the number of convolutional layers helps improve performance.
Capsule network - too complicated and not worth it as it takes up all the task that the language is supposed to perform. Maybe a good benchmark though.

Softmax-Argmax Quantizer

Using the Softmax-Argmax sampler with a double-LSTM speaker and listener performs as well as using the VQ-VAE-style quantizer. There is a slightly larger number of unique generated sequences. Still need to test language quality.
Softmax-argmax quantizer reaches godly performance on a 3-shape-type, 3-color, 3-objects Alec mode task.

Experiments

Table of Contents

Autoencoding

Vanilla Autoencoding

VQ-AE-GAN

Geometric Scene Similarity

Simple Shape Counting

Pushing the Limits of Language

Progressive Language Expansion

Alec Mode

Variable Length Sequences

Out of Distribution Prediction

Complicating the Generation Unit

Complicating the Visual Unit

Softmax-Argmax Quantizer

Random Sampler

Gumbel-Softmax Sampler

Sparsity Restraint on Visual Unit