Bootstrapping Mappy – Pomona College FAIM Lab

Life Outside of Work

Posted on 2022-10-17 by Joseph C. Osborn

I didn’t get much work done since last time. Last week (and the week before) my family all fell like dominoes to sinus infections and colds. That meant other tasks were more important than research work: caring for family members, caring for myself, even playing Splatoon all took priority. When life isn’t normal, you can’t pretend like it is normal.

A little distance helped me in other ways, too. For example, I was able to complete some refactoring on the weekend that I had been meaning to do for a while, which substantially reduced the complexity of the interactive Mappy player and its debug outputs. I realized that I should be using 3D convolutions in the scroll detection application and I remembered that there’s a Rust Jupyter kernel (evcxr) that I could use to prototype and visualize my work there. I also got a chance to think about the accessibility work beyond manually tagging individual sprites and tiles with affordances.

Taken together, slowing down for a week meant that I could use my time more efficiently when I was well enough to work again. Today I have recommendation letters and search committee stuff to do, but I am confident that I can catch up on that and push the accessibility work to an Foundations of Digital Games submission within a few weeks.

A few false starts, 2022-10-03

Posted on 2022-10-03 by Joseph C. Osborn

I don’t have much of an intuition about convolutional neural nets—I think that in this application, single pixel features are really important so I have avoided pooling so far. Still, I thought it would be worth giving a more traditional convolve->pool->convolve->pool architecture a try.

I didn’t get any better convergence with that, so I started to suspect my dataset was faulty. In fact, I had an off-by-one error in my CSV loader that treated the first row of data as a header row. Classic mistake! So the learned relations were basically garbage. After fixing that, I tried to train the models on just the last ten or so samples. This overfit well enough, so I ran all three models again on half of the dataset.

At this point I had tried a lot of different things, so I took some time to notice something interesting with the behavior of the models. The simple MLP was hitting a plateau (even if I added more parameters), while the single-convolutional model began to do much better in terms of error, although it had a strong tendency to predict negative scrolling in my example case. When I removed the batch normalization layer from that model it began to act more in line with the simple MLP. Since all three models (even the double convolutional one!) were now converging to the same loss (and predicting a scroll of (1,0) for my (13,0) test example), I got suspicious again. Why were they all ending up at the same place, learning essentially the same function? Why did they get stuck and why was the result so wrong with respect to my example case? I looked at the biases in the last layer and indeed they were just (1,0); I guess predicting (1,0) for everything gave a pretty decent overall error in this data set. It turned out the culprit here was my loss function, which I had set to L1 loss (mean absolute error). For this problem, mean squared error gave much better results.

In both of these quirky problems, walking away from the machine and getting some air helped me in ways that continually trying more permutations of the input hyperparameters just wouldn’t have. I often tell students that they should take a break from a problem if they are spending too long on it, or feeling stuck and trying things randomly. It’s hard to remember that advice in my own work sometimes!

At any rate, batch normalization seemed like it was worth adding to the two-layer convolutional model, so I threw it in there. Now, both the single and double convolutional models were fitting pretty well to a subset of the data and giving good results for the test case. When I expanded the data to include the full 170 examples from a single play of Mario 3 1-1, I saw much faster convergence for the double-convolutional model although both the single- and double-convolutional models continued to improve with more training time (up to 4000 epochs, where I arbitrarily cut off the training).

My next step is to record two longer playthroughs of (the same) Super Mario 3 level so I can have larger separate test and training sets. Once a model is trained effectively on that dataset, I’d like to exercise these models with traces from more Mario 3 levels, then on traces from more games!

I also noticed that one CPU was pegged during training, so I want to run a profiler on the code to see if there’s any bottleneck there. The GPU is saturated the whole time so I don’t think it’s slowing down training, but I’m curious as to what’s happening on the CPU.

More Adventures in Over-Fitting, 2022-09-30

Posted on 2022-09-30 by Joseph C. Osborn

Today I decided to jump right to using a simple convolutional model before getting a larger data-set, just to compare it with the MLP I used last time. It uses one 2D convolution that works on all 6 channels of the two input frames, followed by ReLU activation, a fully-connected layer, ReLU activation again, and another fully-connected layer. This is a bigger model that takes more time to converge, and it doesn’t perform any better on the tiny data set. This should come as no surprise! All it’s really doing is putting a bunch of filters on the front of our simple network from before.

There was nowhere else to go at this point except getting more data. For this experiment, just using a short Mario 3 trace was enough to see if the model could learn on a single level of a single game (about 130 scrolling examples). This meant writing a quick data set loader and getting the data into the right shape.

At this point, the MLP’s loss plateaued and wouldn’t go down any further, so I added another fully-connected layer just to see what would happen. Playing with the number of hidden nodes, the batch size, and the learning rate didn’t help much. Maybe using a patch encoding would do better (perhaps by first applying a separate fully connected layer on each 16×16 patch of the input image, and then merging those together at the input to the network), but I wanted to see if a very simple convolutional network could do better at generalizing. I also ended up converting my images to grayscale to save a little GPU memory.

Reader, after fiddling with some hyperparameters and a couple of thousand epochs, it kind of almost did—it reported correct horizontal scrolling, but not vertical (odd, since vertical scrolling is always 0 in these examples). My next step is to get the convolutional model to really effectively overfit this small (130-example) dataset.

The Simplest Possible Model, 2022-09-29

Posted on 2022-09-29 by Joseph C. Osborn

After the last journal entry I realized it would be a better move to collect a little bit of data (just a few frames) from just one game, then straightaway come up with a model that could perfectly fit it and measure its performance on inference. This would test the whole pipeline, from collection through to inference, and give a lower bound on its wall-clock time (which is important because I hope this can be used in an interactive setting). So, I’ll use “Super Mario Bros. 3” as my tiny example, since it scrolls both left and right.

First, I had to locate two frames that had some scrolling offset. I chose frames 630 and 637 of my test run, the 90th and 91st samples from the CSV file. According to line 91 of the data file, the scrolling from 630 to 637 was 13 pixels, so I double-checked that by eye and it seemed correct. I thought it would be good to have four cases in my training data for this super-overfit, extremely simplistic model: One with positive scrolling, one with negative scrolling, and two that were stationary. I could use 630->637, 637->630, 630->630 and 637->637 respectively.

For implementing and training the neural network, I was torn between Python (which is convenient but error-prone and unpleasant) and Rust using something like Torch bindings. I’ve never used the Rust Torch bindings before, and they look pretty nice, so I’m going to give them a shot. After a bit of cargo new and cargo add (and making sure I had torch and CUDA installed properly, itself always a half-hour adventure) the project is ready.

My first goal was just to load up those two images and create the data set in code. After I had a model that could overfit it, I would load the data using a more realistic pipeline (but limit it to just a few frames using a small slice of the dataset). Then I would load a complete trace, avoiding artifacts due to things like scene transitions (where the NES hardware would scroll a whole screen over at once) by filtering out scroll changes larger than some threshold. That’s a plan!

I copied over tch-rs’s hello world example and modified it to print whether CUDA was supported. It wasn’t! I had to download libtorch from the torch website and set a couple of environment variables, and then everything was fine. So, the next step was to load up a couple of images—in Rust I usually use the image crate for this, but tch offers a tch::vision::image module so I gave that a shot. Hard coding filesystem paths and creating tensors by hand is fine; each API has its own way of doing things but I was able to put together something using tensor operations. I made a commit and got ready to build a simple neural network, following the tch-rs README (but adding in batching via tch::data::Iter2).

Throwing bits and pieces together, I had to start thinking about the optimizer, loss functions, and so on. Since I was using batching, I also started bumping into tensor shape mismatch errors. I really don’t like that tensor size issues have to be debugged at runtime—at least with tch-rs. I eventually found that the sample code (an mnist classifier) was assuming that images were already one-dimensional tensors, while the images I was loading were channel-by-width-by-height tensors. A bit of hacking around with view calls got me to something that ran end-to-end, but I knew I would need to replace the simple linear model with something convolutional. Still, it was worthwhile to have something working so that I could swap out individual pieces, rather than making several complex things work together from the start.

One issue in particular took some digging to figure out: I had set up an iterator for image batching that I ran through once and used up, but then tried to use again in other places (and it ran out of items immediately). I noticed it when I tried to debug my per-batch losses and found that those debug calls weren’t happening at all. Re-creating the iterator each epoch did the trick, and I got something that overfit and learned those four scroll changes after 100 epochs (interestingly, error didn’t always stabilize on reaching 0.0, which makes me think I have something going on in my optimizer; but adding more epochs seemed to do the trick).

Tomorrow, I’ll see if this very simple model (an MLP with just one hidden layer and 256 hidden nodes) can handle the whole Mario 3 input sequence I produced (again, minus throwing out awkward scrolling data as I mentioned earlier). I don’t have a lot of confidence in this estimate, but it looks like putting a pair of images on the GPU and making a scrolling guess on them takes around 1 millisecond with this model, which is basically practical. I could probably save a good chunk of that time if the images were already on the GPU (looks like we get to one tenth of a millisecond that way), if I only used grayscale images, or if I used lower-resolution floats, but it doesn’t make sense to think too much more deeply about efficiency yet since the architecture is nowhere near final. One millisecond is probably fine.

Getting Ready to Build a Dataset, 2022-09-28

Posted on 2022-09-28 by Joseph C. Osborn

I decided a good starting point would be to revisit the detection of camera movement in games, since it should help with object tracking to know what’s static terrain and what’s dynamic objects. I opened up the source code of the hacked Mappy interactive player Chloe was using to dump scrolling/camera movement data, and I thought about whether dumping scrolling data, game images, et cetera should be added to the standard Mappy player or kept separate as it is now. I decided to roll in the data collection code to the interactive and batch players, since I didn’t want to have separate interactive and batch scroll data dumping players.

This meant taking the ad hoc code that dumped game images and scrolling data (previously something like a 20-line diff patched in at various places in the interactive player) and creating a =ScrollDumper= struct which offers initialization (making data directories), update (recording the data), and finalize (writing the data to disk) functions. Now I can use this same struct in the batch player when I’m ready to move it over, or potentially lift it into Mappy proper; it also means I could generalize to e.g. performance stat dumping, sprite data dumping, et cetera. For the time being, it just outputs screenshots, a CSV with scrolling data (relative movement frame to frame), and the sequence of inputs used to produce them.

In the code, it’s stored as an Option so that collection of this data can easily be turned on and off. In the future it’s possible it could implement some data-collection trait and be put into a Vec<Box> or something. But I don’t need that yet.

In the process of doing this, I changed the screenshot dumping code to use the existing framebuffer structure (and encode it to PNG) rather than making a copy of the framebuffer. With that bit of programming done, I tried out the interactive player in data-dumping mode on “Super Mario Bros.”, and verified by hand that it was outputting reasonable scrolling data.

The data look okay, so the next step is to state the problem specifically: Given a series of screenshots (let’s say two screenshots, and grayscale is fine), we want to describe the camera movement from one to the other as two numbers (and ideally get a bounding box for the part of the image that actually scrolls, but we can leave that for later). One complicating factor is that games usually run at 30 or 60 frames per second, but it’s quite costly to collect and use data at that framerate; moreover, actually doing inference isn’t free either and needing to do it 30 times a second—along with all the other stuff we want to do every frame in Mappy—seems untenable. So we’ll aim for the neighborhood of 10 frames per second.

If we could really work at that high framerate, we could assume that individual camera movements are very small (plus or minus four pixels per frame feels reasonable; usually on the NES it will be something like a repeating pattern of 2 pixels, 2 pixels, 3 pixels for an average of 2.33 pixels per frame). But we can’t handle samples that frequently, so the nice discrete classification problem (which of these 8 values is the horizontal camera movement?) turns into a fuzzier regression problem. Using more frames seems like it could help recover more accurate data, but for now let’s see how far we get by treating it like a regression problem (say, between negative 32 and 32 pixels of movement).

Since individual pixel shifts are really important here, I don’t think it makes sense to do the stacks of convolutions and pooling that are typical for computer vision problems. Or at least, if that is the move, it needs to be supplemented with skip connections so that the original image can feed through to the later stages. I read about an architecture called MLP-Mixer in Tolstikhin et al.’s “MLP-Mixer: An all-MLP Architecture for Vision”, and I guess this is a special case of something called a Vision Transformer. These approaches break up an image into non-overlapping patches and use an MLP to do feature extraction on each patch individually (or on some linear projection of those patches) before assembling these patches using some mechanism (they also use skip-connections to push original image data through to the later stages).

I will look into these kinds of techniques more deeply if something much simpler based on 3D convolutions (previous-frame and current-frame) followed by a multi-layer perceptron (perhaps with the time interval between the frames fed in) doesn’t do the trick. This problem feels simpler than classification and I don’t see why it should need so many parameters and so much training data. My next step is to collect data on some games—let’s say “Kid Icarus”, “Super Mario Bros. 3”, “Batman”, “The Legend of Zelda”, “The Guardian Legend”, and “Dragon Quest IV”. It would be great if I could slam images from a number of games into one model and have it learn to see scrolling in other, not-yet-seen games. The biggest win would be if I could get scrolling data from one or two specific Super Nintendo games (e.g., “Super Mario World” or “The Legend of Zelda: A Link to the Past” or “Super Metroid”) and see if the model could generalize.

All that sounds cool, but to publish a paper on it I’d probably also have to show that this helps with some other problem like background subtraction or object detection. I think I can see how this would fit together—if we knew the camera movement we could see when objects are moving in ways that differ from the camera, and those would be our foreground. There are other applications too including localization and mapping, and I think that should be enough to get something publishable.

Overview, 09/26/23

Posted on 2022-09-26 by Joseph C. Osborn

The key open questions in moving Mappy away from depending on instrumented NES emulators are:

What representation should be used for space? Tiles and pixels have some issues, objects are good but very under-specified as a notion.
Can we avoid the heavy instrumentation for camera movement detection? It’s very NES-specific as well as one of the few things (the only thing?) we need a custom emulator core for.
We want to cleanly separate layers by foreground/background or segment by objects, so that we don’t need sprite tracking or grid identification/tile alignment.

I wanted to use Mappy to generate datasets for machine learning for a while now. Chloe Sun ’23 did some work on scroll-detection last year or the year before (using scroll data from Mappy) but we didn’t have much awareness of computer vision ML architectures, so it was fragile. Chanha Kim ’22 and Jaden Kim ’22 did some good work applying YOLO to game datasets using synthetic data (not Mappy-generated tags, because Mappy at the time didn’t do blobbing). So I am feeling pretty strongly that doing any of (1,2,3) depends on having a good pipeline for collecting data about game play.

I read two cool papers recently that got me thinking about this area again. Harley et al. had an interesting one called “Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories”, which takes a series of image frames and a set of pixels of the initial image to track, and it identifies the movements of those particles over time. It’s neat that it can track specific target pixels, and it’s neat that it tracks through occlusions as well.

This seems promising for something like the sprite tracking that Mappy needs to do, but it doesn’t solve everything. First, it can’t distinguish camera from object movement because it’s a low level technique, at the level of tracking optical flow—for featureless Super Mario backgrounds, I think it would see platforms/terrain as moving left through space as Mario moves rightwards through the level, and if Mario and the camera move at the same speed it would think Mario is stationary. So there would be some work to back out “scene movement” from “object movement” which depends on knowing what’s static terrain and what’s foreground (a scrolling model like the one Chloe and I tried to figure out would be helpful here).

The second thing to think about is that particles are not objects—a grid of particles could be used to approximate a set of objects (e.g., groups of particles that move together are probably the same object), but that’s an extra set of steps. And grids aren’t always what we want; they’re great for Mario but less good for Super Metroid. Still, it seems like a great foundational technique. I only saw one or two performance numbers in the paper (I guess it depends on your GPU, resolution, and everything else), but if it were fast enough for interactive use that would be really cool; I doubt it though, since it wants 200ms for 8 frames at 480×1024 resolution, and I would need it at least one and hopefully two orders of magnitude faster. Another thought: I could borrow the 8-frame tracking idea that leverages both appearance similarity and movement prior to help make Mappy’s existing sprite tracking more robust (it doesn’t really use a movement prior or anything if I recall correctly).

The other paper that I’ve been wanting to read for a while is “MarioNette: Self-Supervised Sprite Learning” by Smirnov et al. Again, it has some cool tricks that I would like to borrow although I don’t think I can use it wholesale. The coolest tricks to me are the idea of learning a sprite dictionary in a self-supervised fashion, and the idea of training a model to learn a scene representation from which the original scene is reconstructed. This scene representation idea breaks a mental block I had since my previous attempts in this area were focused on pixel representations. I’m not sure their representation is exactly what I would want to use, but there is a lot to recommend it. It’s kind of complicated, but a brief summary is that a screen layer is a coarse grid of points, and each point may or may not have a sprite connected to it at some (x,y) offset. Which sprite it is depends on how well the point’s local features match those of any witnessed sprite in the dictionary.

I don’t love that this has a fixed sprite size hyperparameter, and that it depends on either learning one big static background image (and each frame is registered on some offset of that image) or on being provided a fixed background color. But could it be combined with some oracle for background detection? Such a thing could be trained on an emulator that renders the background separately from non-background colors, but the NES PPU’s abstraction of “background” is leaky, it’s purely visual and not semantic (the problem is even worse on SNES with its many layers).

So I read these papers and thought about: how they fit into my goals and plans; what I could borrow from them; and how I could extend them. For now, I think I want to focus on using Mappy (or at least my instrumented emulators) to create a high-quality dataset of game objects, terrains, camera movement, and so on within single rooms (the multi-room mapping registration is currently a bit buggy and I need to investigate it more). I also want to try applying these two papers’ models against the game play data I have, though I don’t think a pre-trained version of the MarioNette model or even its dataset are available (or at least I haven’t found either).