# Experiments

Aggregated experiments and results

## Geometric Scene Similarity

### Simple Shape Counting

tl;dr: Perhaps somewhat unsurprisingly, the model learns to count via token-number association if we restrict its vocabulary to $$N$$ tokens and the number of objects on a screen $$\in [1, N]$$.

All experiments use a language defined by {seq_len=1, vocab_size=10}; a dataset defined by 64x64x3 images, outline and rotation enabled, and {min_shapes=1, max_shapes=10}; and a model trained for 1000 batches of 256 samples with a DLSM architecture (see specifics in the linked full results).

Variation 1: Single Shape Counting. There is only one shape (square) and one color (red). Reaches 0.106625 BCE. Each of the 10 tokens becomes reliably associated with a certain number of objects from 1 to 10.

Full results here.

Variation 2: Varied Shape Counting. There are three shapes (circle, square, triangle) and one color (red). Reaches 0.267393 BCE. Each of the 10 tokens becomes reliably associated with a certain number of objects from 1 to 10. As expected, there is some error for larger numbers of objects due to overlap.

Full results here.

Variation 3: General Object Counting. There are three shapes (circle, square, triangle) and three colors (red, green, blue). Reaches 0.340383 BCE. Each of the 10 tokens becomes somewhat reliably associated with a certain number of objects from 1 to 10. There is more error in exact counting for large shape numbers as in Variation 2, but some of the counting is imperfect for smaller object counts, too.

Full results here.

### Pushing the Limits of Language

What is the relationship between a combination of permitted {vocabulary size, sequence length} and the performance?

Preliminary findings:

• 4 tokens seems to be the minimum vocabulary size for decent performance (without varying length).
• Increasing sequence length can actually have deleterious effects
• The model generally performs well when the vocabulary size is large and the sequence length is small

### Progressive Language Expansion

Idea: begin with a very simple setup (e.g. just blue squares), then slowly introduce new attributes (e.g. blue squares, triangles, circles; then all combinations of {blue, red, blue} and {squares, triangles, circles}) and observe if language is retained and how it adapts to new environmental stimulus.

### Alec Mode

Vanilla Alec Mode

Strong Alec Mode

### Out of Distribution Prediction

• The network seems to be capable of generating novel tokens.

### Complicating the Generation Unit

• Using LSTMs vastly outperforms GRUs. To use LSTMs, set the cell state to the generated image latent vector $$z$$ (output of the visual unit) and the initial hidden state to a random vector (‘sampling’ speech).
• Using Bidirectional LSTMs yields slightly better performance. Still need to test impact on quality of generated words.

### Complicating the Visual Unit

• Increasing the number of convolutional layers helps improve performance.
• Capsule network - too complicated and not worth it as it takes up all the task that the language is supposed to perform. Maybe a good benchmark though.

### Softmax-Argmax Quantizer

• Using the Softmax-Argmax sampler with a double-LSTM speaker and listener performs as well as using the VQ-VAE-style quantizer. There is a slightly larger number of unique generated sequences. Still need to test language quality.
• Softmax-argmax quantizer reaches godly performance on a 3-shape-type, 3-color, 3-objects Alec mode task.

Success!

### Sparsity Restraint on Visual Unit

Output must be somewhat sparse.

I2 - Fusing neuroscience and AI to study intelligent computational systems. Contact us at interintel@uw.edu.