Link Search Menu Expand Document

Experiments

Aggregated experiments and results


Table of Contents

  1. Cartpole Comparisons
    1. Vanilla Advantage Actor Critic (A2C)
      1. Activation Function Experiments
      2. Results
    2. Spiked Learning Rate
      1. Justification
      2. Implementation
      3. Results
    3. A2C with Memory Replay
      1. Results
    4. Proximal Policy Optimization (PPO)
      1. Results
  2. Other Environments
    1. Mujoco

Cartpole Comparisons

Vanilla Advantage Actor Critic (A2C)

image

Activation Function Experiments

Testing combinations of {ELU, Sigmoid, LeakyReLU, ReLU, Tanh, Hardswish, Hardsigmoid, HardTanh and more} activation functions on the Actor and Critic respectively.

Results

ELU/ReLU combo Tanh ReLU combo worked best. Further details here

Spiked Learning Rate

Justification

It is a well observed phenomena in the brain that when a misprediction occurs, neuronal activity is relatively larger than if that same prediction had been accurate. As such, we sought to emulate similar behavior within the actor model by increasing the learning rate (mu) within the model under certain criteria. The criteria for possible testing include:

  • When the total model reward breaks a previously set threshold
  • When the model loss across a certain number of episodes is steadily decreasing
    • May not work due to the training nature of actor critics… Rewarded not punished so increasing the learning rate on poor performance may drive poorer performance. Will investigate empirically
  • When the model’s reward decreases
    • May not work due to the training nature of actor critics… Rewarded not punished so increasing the learning rate on poor performance may drive poorer performance. Will investigate empirically.

Implementation

I decided to tackle the first spiking criteria through a simple implementation: When the reward for an episode breaks a threshold (set by the previous highest reward in an epoch), the learning rate is spiked (multiplied) by a constant set in the config file.

The spiking constant was 5, and in this test I could not observe any largely significant difference between the two (Non spiked on left, spiking on right). The graphs were messy so perhaps that is an issue. May re-run tests with greater # of epochs. Will also experiment with a greater spiking value to elicit some kind of change. Can also check out other spiking criteria.

Results

Inconclusive/Ineffective
image image

A2C with Memory Replay

Results

Converged after roughly 1100 episodes. image image

Proximal Policy Optimization (PPO)

Results

Converged after roughly 600 episodes. image image


Other Environments

Mujoco



I2 - Fusing neuroscience and AI to study intelligent computational systems. Contact us at interintel@uw.edu.