Today I decided to jump right to using a simple convolutional model before getting a larger data-set, just to compare it with the MLP I used last time. It uses one 2D convolution that works on all 6 channels of the two input frames, followed by ReLU activation, a fully-connected layer, ReLU activation again, and another fully-connected layer. This is a bigger model that takes more time to converge, and it doesn’t perform any better on the tiny data set. This should come as no surprise! All it’s really doing is putting a bunch of filters on the front of our simple network from before.
There was nowhere else to go at this point except getting more data. For this experiment, just using a short Mario 3 trace was enough to see if the model could learn on a single level of a single game (about 130 scrolling examples). This meant writing a quick data set loader and getting the data into the right shape.
At this point, the MLP’s loss plateaued and wouldn’t go down any further, so I added another fully-connected layer just to see what would happen. Playing with the number of hidden nodes, the batch size, and the learning rate didn’t help much. Maybe using a patch encoding would do better (perhaps by first applying a separate fully connected layer on each 16×16 patch of the input image, and then merging those together at the input to the network), but I wanted to see if a very simple convolutional network could do better at generalizing. I also ended up converting my images to grayscale to save a little GPU memory.
Reader, after fiddling with some hyperparameters and a couple of thousand epochs, it kind of almost did—it reported correct horizontal scrolling, but not vertical (odd, since vertical scrolling is always 0 in these examples). My next step is to get the convolutional model to really effectively overfit this small (130-example) dataset.