I don’t have much of an intuition about convolutional neural nets—I think that in this application, single pixel features are really important so I have avoided pooling so far. Still, I thought it would be worth giving a more traditional convolve->pool->convolve->pool architecture a try.
I didn’t get any better convergence with that, so I started to suspect my dataset was faulty. In fact, I had an off-by-one error in my CSV loader that treated the first row of data as a header row. Classic mistake! So the learned relations were basically garbage. After fixing that, I tried to train the models on just the last ten or so samples. This overfit well enough, so I ran all three models again on half of the dataset.
At this point I had tried a lot of different things, so I took some time to notice something interesting with the behavior of the models. The simple MLP was hitting a plateau (even if I added more parameters), while the single-convolutional model began to do much better in terms of error, although it had a strong tendency to predict negative scrolling in my example case. When I removed the batch normalization layer from that model it began to act more in line with the simple MLP. Since all three models (even the double convolutional one!) were now converging to the same loss (and predicting a scroll of (1,0) for my (13,0) test example), I got suspicious again. Why were they all ending up at the same place, learning essentially the same function? Why did they get stuck and why was the result so wrong with respect to my example case? I looked at the biases in the last layer and indeed they were just (1,0); I guess predicting (1,0) for everything gave a pretty decent overall error in this data set. It turned out the culprit here was my loss function, which I had set to L1 loss (mean absolute error). For this problem, mean squared error gave much better results.
In both of these quirky problems, walking away from the machine and getting some air helped me in ways that continually trying more permutations of the input hyperparameters just wouldn’t have. I often tell students that they should take a break from a problem if they are spending too long on it, or feeling stuck and trying things randomly. It’s hard to remember that advice in my own work sometimes!
At any rate, batch normalization seemed like it was worth adding to the two-layer convolutional model, so I threw it in there. Now, both the single and double convolutional models were fitting pretty well to a subset of the data and giving good results for the test case. When I expanded the data to include the full 170 examples from a single play of Mario 3 1-1, I saw much faster convergence for the double-convolutional model although both the single- and double-convolutional models continued to improve with more training time (up to 4000 epochs, where I arbitrarily cut off the training).
My next step is to record two longer playthroughs of (the same) Super Mario 3 level so I can have larger separate test and training sets. Once a model is trained effectively on that dataset, I’d like to exercise these models with traces from more Mario 3 levels, then on traces from more games!
I also noticed that one CPU was pegged during training, so I want to run a profiler on the code to see if there’s any bottleneck there. The GPU is saturated the whole time so I don’t think it’s slowing down training, but I’m curious as to what’s happening on the CPU.