Simple Maze Experiement – Part 5

Afte running may variations of training with many variations of hyperparameters, and a few changes to the agent and the training code itself, I’m still not entirely sold that Visual Observation is the best method of Machine Learning for a Unity Agent, at least not in the method I wanted to use it for.

To start with, I limited myself to changing hyperparameters only, and ended up settling on the following:

use_recurrent: true
sequence_length: 64
memory_size: 256
num_layers: 2
gamma: 0.99
batch_size: 128
buffer_size: 2048
num_epoch: 5
learning_rate: 3.0e-4
time_horizon: 64
max_steps: 5e4
beta: 5e-3
epsilon: 0.2
normalize: false
hidden_units: 512

However, I found that tweaking the parameters alone was not enough, so In addition to these parameters, I also updated the agent to have 4 cameras (pointing in the 4 cardinal directions), and to interpret the view from these cameras as 80×80 pixels. This was a required change as the original version that tried to process a 640×480 resolution window would run of memory and crash. I’m not sure if the additional cameras have helped or not, but dropping the resolution of the images down definitely did stop the crash issues. I also increased the training speed to 50, changed the agent interaction from a rigidbody to a character controller and reduced decision making interval down to 3 (from 5).

In regards to the Hyperparameters, the key points are:

  • I have told the agent that it is recurrent, meaning it should remember the last few actions it has taken,
  • I’ve given it 2 visible layers in the neural network, and then 512 hidden layers.
  • I’ve also set to run the full training iteration over 50000 runs (max_steps). I did have longer steps set up, but they appeared to make minimal difference to the result. A set of hyperparmeters that had 50000 steps actually gave me worse results than 3000000 steps, Possibly due to overfitting the network.

I’ve added a tensorflow breakdown of the two runs below:

This first one is the 5e4 (50,000) steps run. The maximum rewarded in a run is 1, though the agent loses a percentage of this depending on how long it takes to get to the goal. As you can see in the cumulative_reward graph, after the initial learning hurdle, the agent started to work out where it was going, and consistently scored between .7 and .8 of the reward.

The 3e6 (3,000,000) step run, while taking a lot longer to run, was also a lot more inconsistent:

This leads me to believe that the model wasnt really “learning” all that well after the initial spike, as it kept repeating its mistakes and constantly failing to get consistent rewards.

Interestingly, ife we overlay the same time period of the two items (the first 30 minutes), the patterns are actually a bit more similar:

After these runs, I was curious as to how much the number of cameras was affecting the agents ability to find it’s goal, so I did two more runs. The I kept all the hyperparameters as above, and simply cut the number of cameras on the agent from 4 to 1 (a forward facing camera). The results are below (orange line) overlayed with the original 5k run with 4 cameras (red line).

As you can see, while this is a minor difference in the training, there is not enough of a difference between one and four cameras to make it worth the overhead of having all four present.

Another thing I noticed in both the training, and in the finished model, the agent doesnt really behave as I’d expect. Rather than looking for the exit and heading directly for it, it seems to follow the right or left hand rule. It sticks to the side of the arena until it gets near the exit point, then runs towards it. At first this confused me as I couldnt understand why it would do this and not just head straight for the exit,and then I realised the issue was two fold.

The thing is, I’d reduced the size of the camera image to 80px by 80px, this meant that any form of learning was being done on a tiny square that may not obviously show where the exit is.
this is especially true when I took into account my exit point was literally .2 units high in Unity (basically a very slightly raised platform), which in an 80×80 pixel snapshot from the agents camera, is barely visible in the viewport unless you are standing on top of it. The result ended up being an agent that run round in concentric circles until it hit the platform, rather than actually looking for an identifying the goal location.

In order to rectify this, I dropped a giant green cube on top of the exit to act as a beacon:

The original goal design.
The improved goal design.

I then built these changes and re-ran the 50k training single camera run again, and got the following result. The new goal design is the blue line, the old goal design is the same orange line as above.

While there was a lot more accuracy in the initial few runs, this still appeared to fluctuate quite wildly over the course of the run, which makes me think that while this change may have had a small impact, it clearly wasnt enough. On top of that, if I took the finished model and dropped it into Unity, the results were…less than stellar.

Once the goal starts moving around, the agent resorts to running around the edge trying to find it, despite having a clear line of sight across the playspace, its more akin to its memorised the areas the goal may appear, and its checking them, than it is actually looking for where the goal is and going to it, a theory I was able to confirm by making a new goal and placing it in a location the other goals had not been, the agent simply ran around the edge of the field for a while checking the usual places, and eventually managed to hit it by a complete accident when it ran across the field in a zig zag pattern.

I think might next step will be to up the camera resolution for the agent, and see if that yields any kind of improvement.