The Simplest Possible Model, 2022-09-29

After the last journal entry I realized it would be a better move to collect a little bit of data (just a few frames) from just one game, then straightaway come up with a model that could perfectly fit it and measure its performance on inference. This would test the whole pipeline, from collection through to inference, and give a lower bound on its wall-clock time (which is important because I hope this can be used in an interactive setting). So, I’ll use “Super Mario Bros. 3” as my tiny example, since it scrolls both left and right.

First, I had to locate two frames that had some scrolling offset. I chose frames 630 and 637 of my test run, the 90th and 91st samples from the CSV file. According to line 91 of the data file, the scrolling from 630 to 637 was 13 pixels, so I double-checked that by eye and it seemed correct. I thought it would be good to have four cases in my training data for this super-overfit, extremely simplistic model: One with positive scrolling, one with negative scrolling, and two that were stationary. I could use 630->637, 637->630, 630->630 and 637->637 respectively.

For implementing and training the neural network, I was torn between Python (which is convenient but error-prone and unpleasant) and Rust using something like Torch bindings. I’ve never used the Rust Torch bindings before, and they look pretty nice, so I’m going to give them a shot. After a bit of cargo new and cargo add (and making sure I had torch and CUDA installed properly, itself always a half-hour adventure) the project is ready.

My first goal was just to load up those two images and create the data set in code. After I had a model that could overfit it, I would load the data using a more realistic pipeline (but limit it to just a few frames using a small slice of the dataset). Then I would load a complete trace, avoiding artifacts due to things like scene transitions (where the NES hardware would scroll a whole screen over at once) by filtering out scroll changes larger than some threshold. That’s a plan!

I copied over tch-rs’s hello world example and modified it to print whether CUDA was supported. It wasn’t! I had to download libtorch from the torch website and set a couple of environment variables, and then everything was fine. So, the next step was to load up a couple of images—in Rust I usually use the image crate for this, but tch offers a tch::vision::image module so I gave that a shot. Hard coding filesystem paths and creating tensors by hand is fine; each API has its own way of doing things but I was able to put together something using tensor operations. I made a commit and got ready to build a simple neural network, following the tch-rs README (but adding in batching via tch::data::Iter2).

Throwing bits and pieces together, I had to start thinking about the optimizer, loss functions, and so on. Since I was using batching, I also started bumping into tensor shape mismatch errors. I really don’t like that tensor size issues have to be debugged at runtime—at least with tch-rs. I eventually found that the sample code (an mnist classifier) was assuming that images were already one-dimensional tensors, while the images I was loading were channel-by-width-by-height tensors. A bit of hacking around with view calls got me to something that ran end-to-end, but I knew I would need to replace the simple linear model with something convolutional. Still, it was worthwhile to have something working so that I could swap out individual pieces, rather than making several complex things work together from the start.

One issue in particular took some digging to figure out: I had set up an iterator for image batching that I ran through once and used up, but then tried to use again in other places (and it ran out of items immediately). I noticed it when I tried to debug my per-batch losses and found that those debug calls weren’t happening at all. Re-creating the iterator each epoch did the trick, and I got something that overfit and learned those four scroll changes after 100 epochs (interestingly, error didn’t always stabilize on reaching 0.0, which makes me think I have something going on in my optimizer; but adding more epochs seemed to do the trick).

Tomorrow, I’ll see if this very simple model (an MLP with just one hidden layer and 256 hidden nodes) can handle the whole Mario 3 input sequence I produced (again, minus throwing out awkward scrolling data as I mentioned earlier). I don’t have a lot of confidence in this estimate, but it looks like putting a pair of images on the GPU and making a scrolling guess on them takes around 1 millisecond with this model, which is basically practical. I could probably save a good chunk of that time if the images were already on the GPU (looks like we get to one tenth of a millisecond that way), if I only used grayscale images, or if I used lower-resolution floats, but it doesn’t make sense to think too much more deeply about efficiency yet since the architecture is nowhere near final. One millisecond is probably fine.