Link Search Menu Expand Document

Meeting Notes

Agendas, notes, and ideas from project meetings

Table of Contents

  1. Meeting 8, 5/26/22
  2. Meeting 7, 5/19/22
  3. Meeting 6, 5/11/22
  4. Meeting 5, 5/4/22
  5. Meeting 4, 4/27/22
  6. Meeting with Han, 4/21/22
  7. Meeting 3, 4/20/22
  8. Meeting 2, 4/13/22
  9. Meeting 1, 4/6/22

Meeting 8, 5/26/22

Agenda and Goals

We will meet in CSE2 (exact room to be announced in Discord before meeting starts) at 5:30 PM. Note in Discord if you do not have building access, since CSE2 locks at 5:00 PM; a member will give you access. The tentative agenda is as follows.

  1. Friendly and lively chatter about life (~10 min max)
  2. Get significant work done on writing paper, assign sections and deadlines


Approximate deadline to get the paper ready for submission: end of July.

Things we want to try

  • Disentanglement penalty term
  • Fix up PyTorch model to use sampling
  • Procedurally generated shapes (optional)
  • More rigorous analysis of sequences
  • Code up relational dataset - try that out.

Stuff to work on:

  • Yegor - getting the PyTorch model up to speed and incorporating various novel developments.
  • Andre - start writing section 1, Introduction and Contexts
  • Alec - start writing section 2, Benchmark Task

Meeting 7, 5/19/22

Agenda and Goals

We will meet in CSE2 (exact room to be announced in Discord before meeting starts) at 3:30 PM. The tentative agenda is as follows.

  1. Friendly and lively chatter about life (~10 min max)
  2. Work to complete poster for Discovering AI @ UW event.

Meeting 6, 5/11/22

Agenda and Goals

We will meet in CSE2 (exact room to be announced in Discord before meeting starts) at 5:30 PM. Note in Discord if you do not have building access, since CSE2 locks at 5:00 PM; a member will give you access. The tentative agenda is as follows.

  1. Friendly and lively chatter about life (~10 min max)
  2. Updates on significance of language and quantization (~20 min max)
  3. Updates on visual unit design (~20 min max)
  4. Updates on variable-length vocabulary size (~20 min max)
  5. Discussion on possible future tasks (ARC, Relational Scene Task) (~30 min max)


  • Language is valuable, discretization is valuable
  • Distribution analysis is important
  • Red squares and blue circles
  • If OOD Acc > ID Acc, something is sus and we should determine what’s going on.
  • Color spec OOD accuracy
  • ConvNeXt Tiny architecture as visual unit
  • Paper - describe what we’ve learned so far

Next steps and tasks

  • Two camera perspectives of a scene, multiple objects - maybe reduce to the same problems?
  • Relational-geometric shape dataset
  • Procedurally generated shape challenge
  • Yegor - try to wrap up and perfect PyTorch version of things, then move on to alternate shape dataset where shapes are procedurally generated.
  • Summer - target setting up an environment.
  • RL groundedness - a word can have a sense, what is the meaning of a word? Also has a reference. Instead of focusing on groundedness - words have meaning/sense or/and reference.

RL Ideas

  • Intelligent swarm search - each agent needs to go to a certain location, they can see a small region around them and communicate location information to other agents. Agents can solve the problem just via random walk/systematic walk, but can solve the problem a lot faster with communication. Add landmarks to the scene.
  • Predator-prey system - optimize for collective food count in a world with predators and spawning food.
  • Global vs direct messaging

Meeting 5, 5/4/22

Agenda and Goals

We will meet in CSE2 (exact room to be announced in Discord before meeting starts) at 5:30 PM. Note in Discord if you do not have building access, since CSE2 locks at 5:00 PM; a member will give you access. The tentative agenda is as follows.

  1. Friendly and lively chatter about life (~10 min max)
  2. Progress updates on tasks (~10 min max)
  3. Alec-OOD Task (~10 min max)
  4. Novel language updates: LSTMs, Bidirectionality (~30 min max)
  5. Gumbel-Softmax quantizer (~30 min max)
  6. Additional ideas (sparse visual unit, etc.) (Remainder of time)


  • Important to pass in quantized vectors back into the recurrent unit?
  • Note - make OOD different label really accessible to either OOD or ID.
  • Visual unit - possibly finding too explicit things?
  • Possibility - mapping multiple scenes to the same sequence.
  • Standardized task - weak Alec mode, three shapes, RGB; OOD - red squares (1 to 3 numbers)
  • Sparse visual unit, nonlinear visual unit
  • Variable length - impose a cost.
  • Number of unique sequences.
  • Shrink down language by penalizing both vocabulary size and variable-length sequences.
  • Shrinking down vocabulary usage might be totally worthless



  • Andre - sparse visual unit, nonlinear visual unit, just visual unit
  • Yegor - finish OOD unit, add modern developments
  • etc.

Meeting 4, 4/27/22

Agenda and Goals

We will meet in CSE2 (exact room to be announced in Discord before meeting starts) at 5:30 PM. Note in Discord if you do not have building access, since CSE2 locks at 5:00 PM; a member will give you access. The tentative agenda is as follows.

  1. Friendly and lively chatter about life (~10 min max)
  2. Brief progress updates on tasks (~10 min max)
  3. Refocus on group project, direction, goals, and metrics (~30 min max)
  4. Discussion about variable length methods (~30 min max)
  5. Additional idea discussion and brainstorming (Remainder of time)

Outlining Goals and Current Phase

End of quarter - implementation and explanation of a DLSM model.

  • Avoid ‘build a model to __________’.
  • Goal: to build DLSM model variations to do well on mediocre-strength Alec mode.
    • Capsule network for visual unit
    • Very large or small vocabularies
    • Benchmark - compare against a model without language, pretrain the visual unit and freeze
    • Transformers and attention, LSTMs, more complicated language generation
  • Philosophically - is this the best way to generate language.
  • Structuring experiment parameters more systematically.
  • Dataset that is complicated enough to the point where it can’t do that.
  • Hexagon and star are added - can add
  • Increase image resolution to 100 by 100.
    • Mediocre-strength Alec mode - three colors, five shapes, between one and five objects.
  • Trying not Alec mode with very large sizes.
  • Trying out OOD experiments.
  • Add attention layer to language output

Future Task

  • Relationships dataset, force development of language in which relationships are necessary.
  • Look more into attention
  • Use mediocre-strength Alec mode.

Philosophical Interpretation of Language

  • Based on empirical performance, this is not necessary an invalid language generation method.
  • Designing OOD experiments
  • Increasing the number of colors
  • Is variable length important? When working with fixed length, develops poor encodings.

Variable-Length Sequence

  • Offloads processing to the listener that we want to be instead integrated into the language.
  • How to deal with this?

VQ Problems

  • Variational Quantizer is problematic - sort of just does similarity search quantizer, concepts are embedded in the same space.
  • Use discretization with softmax directly out of the language generation unit and directly convert to labels.
  • Concrete distribution vs Straight-Through Estimator
    • Continuous and Discrete combination
    • Interpolate between a discrete and continuous distributions.
  • Sample from softmax instead of taking argmax of softmax.
  • ‘Soft snap’ vector quantization - probabilistically scale with distance squared

Language Generation

  • Sampling a low-probability word, then choose the next word - overall probability is higher than greedy sampling policy
  • Build beam-search style language searching within the network
  • Biggest current generation issue - deterministic
  • Softmax version - clearer/purer understanding.
  • Sampling is important - deterministic nature limits the model.

Meeting with Han, 4/21/22

  • The input is a language itself - find structures within the image, then build connections in the image.
  • Desired properties: groundedness, compositionality, abstract representation.
  • How do you measure learnedness?
  • How does the model get the ability to distinguish beteen concepts.


  • Eventually after some training, the model develops language. What kind of data makes the model capture the similarity/dissimilarity?
  • How does the model rely that it is similar - what signals is the model using?
  • Accuracy and performance of a subset.
  • Influence functions - take out something, measure how influential a method is. Take one single test image, then see which group of training data is most important to the image. Perform analysis over relevance.
  • Map the importance of training samples to that of a test sample. Ignoring the sequences, we see which training samples are relevant.
  • The language should first be in the model - check if it is in the vector-quantized component either.
  • Ignore quantization to get a ‘purer’ representation to understand what is capable of being processed again. Debugging the visual unit.
  • SHAP - application to language tokens.
  • Data Shapley - Shapley methods on the data rather than necessarily the input.
  • Tracin method - tracing gradient descent explainability.

Variable-Length Architectures

  • Stopping may be easier in a reinforcement learning context.
  • External module controls which mode you want to get in.
  • Generate long sequence and then prune from it.
  • Minimum Description Length (MDL) - Wikipedia.
  • Generate a long sequence, then use a technique like MDL to compress it.
  • When you compress it too much, you’ll lose properties - but maybe you can converge to a balanced point. Obtain a naturally sized diagram.
  • Figure out the high-level project. Show something, have a reachable goal. Research question, try to solve the research question.
  • Set a solid research goal and work towards it.
  • Variable-length itself can be a piece of analysis.
  • Information and automatic encoding - conservative encoding of minimum information.


  • What are good tasks in which having a discrete latent space is better than a continuous explicit one?
  • Machine translation - try to build an abstracted version of language that can control the generation and the instantiated language, an abstract latent representation of the languages.

Meeting 3, 4/20/22

Agenda and Goals

We will meet in CSE2 (exact room to be announced in Discord before meeting starts) at 4:00 PM. The tentative agenda is as follows.

  1. Friendly and lively chatter about life (~10 min max)
  2. Brief progress updates on tasks (~10 min max)
  3. Plan out meeting with Ph.D. student (~20 min max)
  4. Go over website access info, how to put up resources, results, and agenda items (~20 min max)
  5. Discuss different variable-length sequence techniques (~30 min max)
  6. Additional experiment and brainstorming discussion (Remainder of time)


Agenda for Meeting w/ PhD Student

What we want:

  • To explain the idea
  • To get feedback on the ideas
    • General thoughts
    • Variable sequence length
    • Language itself
    • Better language analysis
  • Hear next directions and general steps
  1. Talk about the goal of the project - study the field of emergent language, getting a language to arise in an environment where communication is meaningful without explicitly conjuring language itself. Axioms/properties of language:
    • Symbolic/discrete - not continuous
    • Sequential - there has to be the dimension of time or intrinsic ordering
    • Variable length
    • Groundedness
    • Compositionality
  2. Explain the task (geometric binary scene similarity)
  3. Explain the current best-performing system - dual listener-speaker model
  4. Explain current analysis and results
  5. Other system ideas (if time)
  6. Collect thoughts and next directions - better visual units, better architecture, new problems,e tc.

Variable-Length Sequences

  • Token confidence method of thresholding.
    • Results: average length of around 2, very unstable. If the max length is too large, experiences some sort of catastrophic forgetting - forgets everything and struggles, unlearns, is unreliable.
  • Pressure for shorter sequences: give a loss for every token it outputs. However, putting a loss on every token results with wild training curves. Unstable performance and had difficulty learning, but eventually arrived at a decent solution.
  • Alternative solution: remove the losses on the intermediate tokens, place the loss on the last token (i.e. \(\ge\) threshold, or hits max length without reaching the threshold). This yields a more stable training curve with discrete epochs in which the model discovers new discrete symbolic breakthroughs.
    • To pursue: generate the tokens vs prediction graph over training (GIF)
  • New idea: measure changes in between consecutive symbols. The addition of a new symbol must provide significant benefit.
  • Potential variant: use rolling window difference comparisons. Consider short-length importance.
  • Potential change in meaning: Yegor is horrible... + being an asshole.
    • Analog in geo scene sim task: putting specific meanings on tokens that have other arbitrary rules
  • Potential problem: difficulty producing a consistent syntax. Don’t know when the end is coming.
  • Discovered concrete evidence of a syntax in two tokens, some syntax in three tokens - but it’s not a very language like syntax.
  • Put a bidirectional recurrent unit in the listener or the generator (but not both)
  • Translate into Tensorflow, build simple bidirectional recurrent.
  • Work towards 3 colors, 3 shapes, min number of shapes 1, max number of shapes 3, Alec mode - all shapes in the scene are the same.
    • Try training with a 3 by 3 vocabulary size - start with colors, then shapes, then number. A syntax.
    • Attention seems more suitable, or use the one-hot trick - helps to facilitate the development of the specific syntax.






  • Intrinsic vocabulary size via loss penalization.
  • Allow for speaking nothing, silent tokens.


  • What counts as a ‘working model’?
  • One object - any color, any shape; reached > 90%. 1 sequence length, 10 vocabulary size.
  • Potential behavior: throws out terms and has other terms disambiguate. A lot of weird and interesting compositional behavior.
  • Try out things that we can get with the MNIST dataset.
  • Accuracy and measuring performance - you can get very high accuracy with very little understanding of the image.
  • Training Siamese network in Alec mode with 3 by 3 - reaches 75% accuracy. Looking at the encodings: a token that means green, green triangle, redundant tokens (e.g. ‘not green triangle’). Not very language-like.
  • Focuses on one attribute
  • Potential solution: only vary one attribute in different images. A pair of images are only differing by one attribute/along one dimension.
  • Supercharged Alec mode: image is not defined by a set of shape objects, but the number of things in it, the shape types, and the shape colors. Two paired images only differ in one axis. One-axis variation along Alec mode, strong Alec mode.
  • Language like - everything is encoded sufficiently, there is sufficiently clear evidence of patterns.

Progressive Language Learning

  • Progressively introduce new
  • Initial dataset: MNIST-type dataset with decent accuracy, then trained on more complex dataset.
  • Introduction of new tokens and token usage, observe the theoretically stable development of language.


  • min_shapes = 0 and developing a nothing token
  • Audio, multimodal inputs for language
  • GAN-type model
  • Cheating the Vector Quantizer - when you listen to yourself, don’t quantize: listen to what you mean, not what you say. Instead of actually quantizing, have a loss when shifted off: just quantize directly. (This is how it is done).
  • Maybe not use quantization at all, maybe add loss to encourage clustering/discrete behavior.

Meeting 2, 4/13/22

Agenda and Goals

We will meet in CSE2 (exact room to be announced in Discord before meeting starts) at 4:00 PM. The tentative agenda is as follows.

  1. Friendly and lively chatter about life (~10 min max)
  2. Brief progress updates on tasks (~10 min max)
  3. Evolution of the dataset - designing the dataset and further expansions (~30 min max)
  4. Deep learning and computational linguistics - measuring and interpreting generated languages (Most of the time)
  5. Modeling approaches (Remainder of time)



  • Yegor’s dataset work: Demo notebook, Script
  • Two scenes are considered different if they do not share the same objects (defined by the color and the shape attributes).
  • The dataset includes meta-data on shapes and colors.
  • Goal of the network - should be semantically meaningful, we want some sort of language to enmerge. If we just use an autoencoder, we will end up with very specific information encoded. The best we do is a vector-quantized autoencoder, which is not quite language-like.
  • This task seems especially suited to describe what is in the scene.
  • Dataset includes arbitrary rotation.
  • Possible extensions:
    • Different sizes - needs to more fully understand the semantic properties of the object.
  • Potential usage of capsule networks
  • Potential training of directly training a visual unit without any language - train a raw Siamese network on the dataset.
  • Video based dataset with motions
  • Dataset with arbitrary colors - helps to impose a discrete vocabulary onto a quasi-continuous color spectrum.

Linguistic Measurements

Paper notes - “Determining Compositionality of Word Expressions Using Various Word Space Models and Measures

  • Working with a set of English expressions rather than artificially generated languages
  • 5 different world space models
  • Certain word space methods with the highest correlations of predicting compositionality are the ones that require the largest datasets of different documents
  • Maybe dubious to apply natural language synthesized languages to generate languages without being computationally grounded.
  • Maybe makes sense to roll our own analysis of how language is generated?

Current results seem compositional.

  • Vector quantization - take a vector that should theoretically be an abstract embedding, and snap the entire space to a discrete set of tokens.
  • Model - listening, speaking, and visual units.
  • The model is quite simple, but it seems to work well.

Example demonstration of synonymous encodings.


Collected generated language information.


  • Activation maximization to find the optimal image that is described by some sentence.
  • A lot of rules can be intuitively grasped just via experimentation.
  • Argument/statement for compositionality:
  • Decreasing vocabulary size may help with compositionality - prevents the network from merely developing everything.
  • Compositionality: a symbol for one axis appears consistently irrespective of other axis dimnesions.
  • Current model exhibits dumb compositionality. We see that a token exists for green (0), and 4 seems to distinguish ‘pointiness’.

Green square, triangle, circle (in that order).

  • Extrapolation by holding out on certain shapes: e.g. add a star, very pointy object, hold out on green triangle and see if the language still applies.
  • Weakness in meaningful extrapolation of object counting - any number of blue circles beyond 1 is the same sequence of tokens, which is entirely different from the token sequence used for one blue circle. It abscribes ‘one vs many’ to different tokens.
  • Potentially add boundaries to prevent training via color estimation.
  • Augment the dataset - produces different but similar scenes with a lot of objects.
  • The rules make sense, but they’re probably too complicated - decrease vocabulary size.
  • Set up Turing test - Andre.


  • How to make models variable length?
  • Cost per token/metabolism cost for speaking, introduce a ‘silence’/’pause’ token.
  • Dually output the desired length of the sequence to cut off.
  • Recurr length generation and set threshold. “I will not stop speaking until what I have said is sufficient to describe what I mean.”
  • Zipf’s law - we see that the frequenices of tokens in the generated sequences falls more or less into Zipf’s law (the \(n\)th most common token appears about \(\frac{1}{n}\) as often as the most common token).
  • Train a very simple model to answer simple questions based on the sequence input to statistically extract information.
  • Counting - generate a pathological dataset
  • Increase sophistication/power of the model - LSTMs, attention, etc.
  • Will help up with larger image sizes.


AlecWork on Siamese architecture, work on porting over PyTorch to TF
AmeliaOnce the pattern dataset is finished, work on statistical/programmatic analysis on pattern recognition
AndreTuring test program for Friday, beef up DLSM, Siamese discrete-language architecture testing, work on porting over PyTorch to TF
EricWork on object similarity problem without language to develop better visual unit
YegorClean up analysis code, export dataset into a more usable format (with image references included), fixing and debugging model. Ask NLP professor how to make explicitly variable-length representations internally.

Meeting 1, 4/6/22

Agenda and Goals

This is our first ‘formal’ project meeting together. This meeting has the following objectives:

  • Get to know the team members - interests, specialties, skills, ideas
  • Bring everyone on the same page on the project’s foundational motivations and vision
  • Introduce everyone to the work and research that has already been done by the team
  • Set out directions for team members to work on

Our room is reserved from 4:00 PM to 6:00 PM. We may meet anywhere from 1 to 2 hours, depending on how the discussion flows. We will adhere by the following tentative schedule:

  1. Round table introductions on interests, specialties, skills. Make sure all team members know each other as thinkers and people. Max 10 min.
  2. Present & discuss the project’s foundational motivation and vision, clarify + shape + expand upon the current vision with input from the team. Max 30 min.
  3. Present & discuss literature in the field of Emergent Language. Max 30 min.
  4. Present & discuss work and research that has already been done. Max 30 min.
  5. Discuss new directions and ideas, assign people to work on certain directions. Ideally most time during the meeting is spent on this.


  • Split in literature - autoencoder vs signal game.
  • Compositional - better defined in the context. You’re able to decompose and recompose language as parts, which can be stitched together however you like.
    • A sentence depends on its parts, and you can build things up from it.
  • Imposing language-like mathematical formulations on the language encoding.
  • Discrete properties emerge from even continuous representations.
  • Why does discreteness emerge in human language? To remove noise, categorical.
    • Potential idea: if we want a continuous output, add in noise.
    • VQ-VAE adapted for language
  • Maximizing the concept of understanding. Make the models share weights (?)
  • Alternate technique - add visual understanding to the language representation before decoding.
  • Optimize language based on language or optimize language based on image?

Crude drawing of a ‘no image signal’ architecture (top) vs a ‘yes image signal architecture’ (bottom) for the problem of image object similarity.



Alec and Yegor gave awesome mini-lectures on CNNs, RNNs, and backprop as an introduction to major components used in this project’s system design.

Convolutional Neural Networks

Recurrent Neural Networks



AlecCreate ‘no image signal’ architecture
AmeliaResearch computational linguistics > language quantitative metrics
AndreCreate shape dataset
EricResearch computational linguistics > language quantitative metrics
YegorCreate ‘yes image signal’ architecture

I2 - Fusing neuroscience and AI to study intelligent computational systems. Contact us at