Gearing up for paper-writing

The last couple days have been about prototyping user interface and rendering techniques for improving the accessibility of old video games. I wanted to see if the basics—drawing things in different colors, tweaking contrast and saturation, labeling affordances, et cetera—were achievable in reasonable time, and indeed they are. I want to get some screenshots together for a demo post tomorrow, but for today my goal was to decide if it made sense to outline a research paper, and I’m pretty sure it does. The challenge for an FDG paper will be showing that it’s a significant enough improvement or new application over e.g. Batu’s work, and I think I can get there by showing generality and the fact that some labels (and potentially more in the future) are automatically generated.

To guide the paper, I want to showcase a problem and my contributions. The problem is that many videogames are inaccessible to people with vision impairments; while new games can be made to support larger groups of players, this depends on the cooperation of the game developer. For other games, we want to find ways to increase their accessibility after the fact. For the large class of Nintendo NES games, I have developed a technological approach and tool for efficiently labeling their graphics while playing with key affordances, which are then used to modulate the display to make it more accessible to players with different visual needs. Some of these tags are inferred automatically from play while others are explicitly given by the user. As a side effect, this also produces a rich dataset of game images tagged with fine-grained affordance data, which will be useful for bootstrapping this work to unseen games or onto new, non-NES platforms in a machine learning regime. I have a natural tendency to want to make this bigger—add in motor assists and point-to-move, add in visual cues for sound effects, learn all tags automatically—but I will try and keep it short this time for a change.

I’ll need to double check that recent FDG reviewers have accepted such “tools papers”—they did publish Eric Kaltman’s and my GISST work a few years back, so I think it’s plausible—or whether I need to put more effort into showing that it indeed improves accessibility, to create a set of accessible games, to generate affordance tags automatically, etc.

Life Outside of Work

I didn’t get much work done since last time. Last week (and the week before) my family all fell like dominoes to sinus infections and colds. That meant other tasks were more important than research work: caring for family members, caring for myself, even playing Splatoon all took priority. When life isn’t normal, you can’t pretend like it is normal.

A little distance helped me in other ways, too. For example, I was able to complete some refactoring on the weekend that I had been meaning to do for a while, which substantially reduced the complexity of the interactive Mappy player and its debug outputs. I realized that I should be using 3D convolutions in the scroll detection application and I remembered that there’s a Rust Jupyter kernel (evcxr) that I could use to prototype and visualize my work there. I also got a chance to think about the accessibility work beyond manually tagging individual sprites and tiles with affordances.

Taken together, slowing down for a week meant that I could use my time more efficiently when I was well enough to work again. Today I have recommendation letters and search committee stuff to do, but I am confident that I can catch up on that and push the accessibility work to an Foundations of Digital Games submission within a few weeks.

A few false starts, 2022-10-03

I don’t have much of an intuition about convolutional neural nets—I think that in this application, single pixel features are really important so I have avoided pooling so far. Still, I thought it would be worth giving a more traditional convolve->pool->convolve->pool architecture a try.

I didn’t get any better convergence with that, so I started to suspect my dataset was faulty. In fact, I had an off-by-one error in my CSV loader that treated the first row of data as a header row. Classic mistake! So the learned relations were basically garbage. After fixing that, I tried to train the models on just the last ten or so samples. This overfit well enough, so I ran all three models again on half of the dataset.

At this point I had tried a lot of different things, so I took some time to notice something interesting with the behavior of the models. The simple MLP was hitting a plateau (even if I added more parameters), while the single-convolutional model began to do much better in terms of error, although it had a strong tendency to predict negative scrolling in my example case. When I removed the batch normalization layer from that model it began to act more in line with the simple MLP. Since all three models (even the double convolutional one!) were now converging to the same loss (and predicting a scroll of (1,0) for my (13,0) test example), I got suspicious again. Why were they all ending up at the same place, learning essentially the same function? Why did they get stuck and why was the result so wrong with respect to my example case? I looked at the biases in the last layer and indeed they were just (1,0); I guess predicting (1,0) for everything gave a pretty decent overall error in this data set. It turned out the culprit here was my loss function, which I had set to L1 loss (mean absolute error). For this problem, mean squared error gave much better results.

In both of these quirky problems, walking away from the machine and getting some air helped me in ways that continually trying more permutations of the input hyperparameters just wouldn’t have. I often tell students that they should take a break from a problem if they are spending too long on it, or feeling stuck and trying things randomly. It’s hard to remember that advice in my own work sometimes!

At any rate, batch normalization seemed like it was worth adding to the two-layer convolutional model, so I threw it in there. Now, both the single and double convolutional models were fitting pretty well to a subset of the data and giving good results for the test case. When I expanded the data to include the full 170 examples from a single play of Mario 3 1-1, I saw much faster convergence for the double-convolutional model although both the single- and double-convolutional models continued to improve with more training time (up to 4000 epochs, where I arbitrarily cut off the training).

My next step is to record two longer playthroughs of (the same) Super Mario 3 level so I can have larger separate test and training sets. Once a model is trained effectively on that dataset, I’d like to exercise these models with traces from more Mario 3 levels, then on traces from more games!

I also noticed that one CPU was pegged during training, so I want to run a profiler on the code to see if there’s any bottleneck there. The GPU is saturated the whole time so I don’t think it’s slowing down training, but I’m curious as to what’s happening on the CPU.

More Adventures in Over-Fitting, 2022-09-30

Today I decided to jump right to using a simple convolutional model before getting a larger data-set, just to compare it with the MLP I used last time. It uses one 2D convolution that works on all 6 channels of the two input frames, followed by ReLU activation, a fully-connected layer, ReLU activation again, and another fully-connected layer. This is a bigger model that takes more time to converge, and it doesn’t perform any better on the tiny data set. This should come as no surprise! All it’s really doing is putting a bunch of filters on the front of our simple network from before.

There was nowhere else to go at this point except getting more data. For this experiment, just using a short Mario 3 trace was enough to see if the model could learn on a single level of a single game (about 130 scrolling examples). This meant writing a quick data set loader and getting the data into the right shape.

At this point, the MLP’s loss plateaued and wouldn’t go down any further, so I added another fully-connected layer just to see what would happen. Playing with the number of hidden nodes, the batch size, and the learning rate didn’t help much. Maybe using a patch encoding would do better (perhaps by first applying a separate fully connected layer on each 16×16 patch of the input image, and then merging those together at the input to the network), but I wanted to see if a very simple convolutional network could do better at generalizing. I also ended up converting my images to grayscale to save a little GPU memory.

Reader, after fiddling with some hyperparameters and a couple of thousand epochs, it kind of almost did—it reported correct horizontal scrolling, but not vertical (odd, since vertical scrolling is always 0 in these examples). My next step is to get the convolutional model to really effectively overfit this small (130-example) dataset.

The Simplest Possible Model, 2022-09-29

After the last journal entry I realized it would be a better move to collect a little bit of data (just a few frames) from just one game, then straightaway come up with a model that could perfectly fit it and measure its performance on inference. This would test the whole pipeline, from collection through to inference, and give a lower bound on its wall-clock time (which is important because I hope this can be used in an interactive setting). So, I’ll use “Super Mario Bros. 3” as my tiny example, since it scrolls both left and right.

First, I had to locate two frames that had some scrolling offset. I chose frames 630 and 637 of my test run, the 90th and 91st samples from the CSV file. According to line 91 of the data file, the scrolling from 630 to 637 was 13 pixels, so I double-checked that by eye and it seemed correct. I thought it would be good to have four cases in my training data for this super-overfit, extremely simplistic model: One with positive scrolling, one with negative scrolling, and two that were stationary. I could use 630->637, 637->630, 630->630 and 637->637 respectively.

For implementing and training the neural network, I was torn between Python (which is convenient but error-prone and unpleasant) and Rust using something like Torch bindings. I’ve never used the Rust Torch bindings before, and they look pretty nice, so I’m going to give them a shot. After a bit of cargo new and cargo add (and making sure I had torch and CUDA installed properly, itself always a half-hour adventure) the project is ready.

My first goal was just to load up those two images and create the data set in code. After I had a model that could overfit it, I would load the data using a more realistic pipeline (but limit it to just a few frames using a small slice of the dataset). Then I would load a complete trace, avoiding artifacts due to things like scene transitions (where the NES hardware would scroll a whole screen over at once) by filtering out scroll changes larger than some threshold. That’s a plan!

I copied over tch-rs’s hello world example and modified it to print whether CUDA was supported. It wasn’t! I had to download libtorch from the torch website and set a couple of environment variables, and then everything was fine. So, the next step was to load up a couple of images—in Rust I usually use the image crate for this, but tch offers a tch::vision::image module so I gave that a shot. Hard coding filesystem paths and creating tensors by hand is fine; each API has its own way of doing things but I was able to put together something using tensor operations. I made a commit and got ready to build a simple neural network, following the tch-rs README (but adding in batching via tch::data::Iter2).

Throwing bits and pieces together, I had to start thinking about the optimizer, loss functions, and so on. Since I was using batching, I also started bumping into tensor shape mismatch errors. I really don’t like that tensor size issues have to be debugged at runtime—at least with tch-rs. I eventually found that the sample code (an mnist classifier) was assuming that images were already one-dimensional tensors, while the images I was loading were channel-by-width-by-height tensors. A bit of hacking around with view calls got me to something that ran end-to-end, but I knew I would need to replace the simple linear model with something convolutional. Still, it was worthwhile to have something working so that I could swap out individual pieces, rather than making several complex things work together from the start.

One issue in particular took some digging to figure out: I had set up an iterator for image batching that I ran through once and used up, but then tried to use again in other places (and it ran out of items immediately). I noticed it when I tried to debug my per-batch losses and found that those debug calls weren’t happening at all. Re-creating the iterator each epoch did the trick, and I got something that overfit and learned those four scroll changes after 100 epochs (interestingly, error didn’t always stabilize on reaching 0.0, which makes me think I have something going on in my optimizer; but adding more epochs seemed to do the trick).

Tomorrow, I’ll see if this very simple model (an MLP with just one hidden layer and 256 hidden nodes) can handle the whole Mario 3 input sequence I produced (again, minus throwing out awkward scrolling data as I mentioned earlier). I don’t have a lot of confidence in this estimate, but it looks like putting a pair of images on the GPU and making a scrolling guess on them takes around 1 millisecond with this model, which is basically practical. I could probably save a good chunk of that time if the images were already on the GPU (looks like we get to one tenth of a millisecond that way), if I only used grayscale images, or if I used lower-resolution floats, but it doesn’t make sense to think too much more deeply about efficiency yet since the architecture is nowhere near final. One millisecond is probably fine.

Getting Ready to Build a Dataset, 2022-09-28

I decided a good starting point would be to revisit the detection of camera movement in games, since it should help with object tracking to know what’s static terrain and what’s dynamic objects. I opened up the source code of the hacked Mappy interactive player Chloe was using to dump scrolling/camera movement data, and I thought about whether dumping scrolling data, game images, et cetera should be added to the standard Mappy player or kept separate as it is now. I decided to roll in the data collection code to the interactive and batch players, since I didn’t want to have separate interactive and batch scroll data dumping players.

This meant taking the ad hoc code that dumped game images and scrolling data (previously something like a 20-line diff patched in at various places in the interactive player) and creating a =ScrollDumper= struct which offers initialization (making data directories), update (recording the data), and finalize (writing the data to disk) functions. Now I can use this same struct in the batch player when I’m ready to move it over, or potentially lift it into Mappy proper; it also means I could generalize to e.g. performance stat dumping, sprite data dumping, et cetera. For the time being, it just outputs screenshots, a CSV with scrolling data (relative movement frame to frame), and the sequence of inputs used to produce them.

In the code, it’s stored as an Option so that collection of this data can easily be turned on and off. In the future it’s possible it could implement some data-collection trait and be put into a Vec<Box> or something. But I don’t need that yet.

In the process of doing this, I changed the screenshot dumping code to use the existing framebuffer structure (and encode it to PNG) rather than making a copy of the framebuffer. With that bit of programming done, I tried out the interactive player in data-dumping mode on “Super Mario Bros.”, and verified by hand that it was outputting reasonable scrolling data.

The data look okay, so the next step is to state the problem specifically: Given a series of screenshots (let’s say two screenshots, and grayscale is fine), we want to describe the camera movement from one to the other as two numbers (and ideally get a bounding box for the part of the image that actually scrolls, but we can leave that for later). One complicating factor is that games usually run at 30 or 60 frames per second, but it’s quite costly to collect and use data at that framerate; moreover, actually doing inference isn’t free either and needing to do it 30 times a second—along with all the other stuff we want to do every frame in Mappy—seems untenable. So we’ll aim for the neighborhood of 10 frames per second.

If we could really work at that high framerate, we could assume that individual camera movements are very small (plus or minus four pixels per frame feels reasonable; usually on the NES it will be something like a repeating pattern of 2 pixels, 2 pixels, 3 pixels for an average of 2.33 pixels per frame). But we can’t handle samples that frequently, so the nice discrete classification problem (which of these 8 values is the horizontal camera movement?) turns into a fuzzier regression problem. Using more frames seems like it could help recover more accurate data, but for now let’s see how far we get by treating it like a regression problem (say, between negative 32 and 32 pixels of movement).

Since individual pixel shifts are really important here, I don’t think it makes sense to do the stacks of convolutions and pooling that are typical for computer vision problems. Or at least, if that is the move, it needs to be supplemented with skip connections so that the original image can feed through to the later stages. I read about an architecture called MLP-Mixer in Tolstikhin et al.’s “MLP-Mixer: An all-MLP Architecture for Vision”, and I guess this is a special case of something called a Vision Transformer. These approaches break up an image into non-overlapping patches and use an MLP to do feature extraction on each patch individually (or on some linear projection of those patches) before assembling these patches using some mechanism (they also use skip-connections to push original image data through to the later stages).

I will look into these kinds of techniques more deeply if something much simpler based on 3D convolutions (previous-frame and current-frame) followed by a multi-layer perceptron (perhaps with the time interval between the frames fed in) doesn’t do the trick. This problem feels simpler than classification and I don’t see why it should need so many parameters and so much training data. My next step is to collect data on some games—let’s say “Kid Icarus”, “Super Mario Bros. 3”, “Batman”, “The Legend of Zelda”, “The Guardian Legend”, and “Dragon Quest IV”. It would be great if I could slam images from a number of games into one model and have it learn to see scrolling in other, not-yet-seen games. The biggest win would be if I could get scrolling data from one or two specific Super Nintendo games (e.g., “Super Mario World” or “The Legend of Zelda: A Link to the Past” or “Super Metroid”) and see if the model could generalize.

All that sounds cool, but to publish a paper on it I’d probably also have to show that this helps with some other problem like background subtraction or object detection. I think I can see how this would fit together—if we knew the camera movement we could see when objects are moving in ways that differ from the camera, and those would be our foreground. There are other applications too including localization and mapping, and I think that should be enough to get something publishable.

Overview, 2022-09-27

Xinyi Therese Xu ’24 and I were talking for a while about adding accessibility features to retro games, especially aids for players with visual and motor disabilities. This picks up a thread of research that I first saw carried by Batu Aytemiz et al. in the paper “Your Buddy, the Grandmaster: Repurposing the Game-Playing AI Surplus for Inclusivity”. I got pretty into this idea around the time I applied for CAREER in the summer of 2021, but 2021-2022 was a really intense academic year and I wasn’t able to explore it much further. While more and more games made today are implemented with accessibility features, so many games of major cultural import will never be patched to include such features. So, how can we hack existing games to add them?

There are a few areas here I want to investigate:

  1. Visual supports, probably with some game-specific information—as in modern PC games, can we highlight game entities with positive or negative valence, de-emphasize and reduce contrast of unimportant details, augment text displays with text-to-speech support, or crop and zoom the screen where that would be helpful?
  2. Motor supports, again with game-specific information about character control. Can we add click-to-move features to Super Mario Bros., use slow-motion or other time travel features to make games accessible to people with differing reaction times, and so on?
  3. Can the information needed by the supports above—the valence of game entities, the control scheme and character physics, et cetera—be defined in a way which is convenient for non-specialist players, potentially by the same people who want to play with these supports?
  4. Can the information needed by the supports above be learned automatically to support arbitrary new games?

Back in 2021, I added some Python scripting support to Mappy. It exposes the sprites and tiles on screen (in world coordinates) and gives write access to the screen pixels, so it’s a good starting point for adding visual supports. I think this is a project which is both interesting and useful, and it’s something that feels close to being publishable, so I’m hoping to pursue it over the next couple of months.

I started today by re-reading Batu’s paper, and picking up an email thread with a researcher at UC Irvine. I haven’t published in accessibility before, and I feel like my work is more likely to land in the general area of Batu’s—a technological intervention with some examples motivated by existing work, which could be evaluated with a real population in the future.

This is one of those projects where I feel pretty comfortable with the basic idea (change the visual appearance of things based on their semantics), and it doesn’t involve coming up to speed on machine learning techniques or other technologies. So I think my first goal will be to figure out, for Super Mario specifically, how I can identify background tiles, foreground terrain tiles, enemies, and powerups; then, how I can best communicate e.g. the state of Mario (big, small, jumping, etc) and the salient aspects of the game state. Since I’m OK doing this manually, I’ll add a way to click on a sprite on screen to assign it some valence or affordances (maybe calling back to my work with Gerard Bentley ’19) and collect game data that way. Then, I’ll probably add some convenience functions for saturating/desaturating/tweaking contrast/etc of rectangular regions of the screen (maybe later supporting an alpha mask to keep the shapes crisp) and map those onto object valences. For the paper, I think it’s enough to just do a few levels of Super Mario Bros., the first dungeon of The Legend of Zelda, etc, and keep track of how long the tagging task takes.

The paragraphs above are essentially an outline of a research paper already: name an area, scope it out, relate it to prior work, make an intervention, and show its utility. I watched Simon Peyton Jones’s “How to Write a Great Research Paper” recently—as eager as I am to start hacking, I think I might actually try to write the paper first this time around.

Overview, 09/26/23

The key open questions in moving Mappy away from depending on instrumented NES emulators are:

  1. What representation should be used for space? Tiles and pixels have some issues, objects are good but very under-specified as a notion.
  2. Can we avoid the heavy instrumentation for camera movement detection? It’s very NES-specific as well as one of the few things (the only thing?) we need a custom emulator core for.
  3. We want to cleanly separate layers by foreground/background or segment by objects, so that we don’t need sprite tracking or grid identification/tile alignment.

I wanted to use Mappy to generate datasets for machine learning for a while now. Chloe Sun ’23 did some work on scroll-detection last year or the year before (using scroll data from Mappy) but we didn’t have much awareness of computer vision ML architectures, so it was fragile. Chanha Kim ’22 and Jaden Kim ’22 did some good work applying YOLO to game datasets using synthetic data (not Mappy-generated tags, because Mappy at the time didn’t do blobbing). So I am feeling pretty strongly that doing any of (1,2,3) depends on having a good pipeline for collecting data about game play.

I read two cool papers recently that got me thinking about this area again. Harley et al. had an interesting one called “Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories”, which takes a series of image frames and a set of pixels of the initial image to track, and it identifies the movements of those particles over time. It’s neat that it can track specific target pixels, and it’s neat that it tracks through occlusions as well.

This seems promising for something like the sprite tracking that Mappy needs to do, but it doesn’t solve everything. First, it can’t distinguish camera from object movement because it’s a low level technique, at the level of tracking optical flow—for featureless Super Mario backgrounds, I think it would see platforms/terrain as moving left through space as Mario moves rightwards through the level, and if Mario and the camera move at the same speed it would think Mario is stationary. So there would be some work to back out “scene movement” from “object movement” which depends on knowing what’s static terrain and what’s foreground (a scrolling model like the one Chloe and I tried to figure out would be helpful here).

The second thing to think about is that particles are not objects—a grid of particles could be used to approximate a set of objects (e.g., groups of particles that move together are probably the same object), but that’s an extra set of steps. And grids aren’t always what we want; they’re great for Mario but less good for Super Metroid. Still, it seems like a great foundational technique. I only saw one or two performance numbers in the paper (I guess it depends on your GPU, resolution, and everything else), but if it were fast enough for interactive use that would be really cool; I doubt it though, since it wants 200ms for 8 frames at 480×1024 resolution, and I would need it at least one and hopefully two orders of magnitude faster. Another thought: I could borrow the 8-frame tracking idea that leverages both appearance similarity and movement prior to help make Mappy’s existing sprite tracking more robust (it doesn’t really use a movement prior or anything if I recall correctly).

The other paper that I’ve been wanting to read for a while is “MarioNette: Self-Supervised Sprite Learning” by Smirnov et al. Again, it has some cool tricks that I would like to borrow although I don’t think I can use it wholesale. The coolest tricks to me are the idea of learning a sprite dictionary in a self-supervised fashion, and the idea of training a model to learn a scene representation from which the original scene is reconstructed. This scene representation idea breaks a mental block I had since my previous attempts in this area were focused on pixel representations. I’m not sure their representation is exactly what I would want to use, but there is a lot to recommend it. It’s kind of complicated, but a brief summary is that a screen layer is a coarse grid of points, and each point may or may not have a sprite connected to it at some (x,y) offset. Which sprite it is depends on how well the point’s local features match those of any witnessed sprite in the dictionary.

I don’t love that this has a fixed sprite size hyperparameter, and that it depends on either learning one big static background image (and each frame is registered on some offset of that image) or on being provided a fixed background color. But could it be combined with some oracle for background detection? Such a thing could be trained on an emulator that renders the background separately from non-background colors, but the NES PPU’s abstraction of “background” is leaky, it’s purely visual and not semantic (the problem is even worse on SNES with its many layers).

So I read these papers and thought about: how they fit into my goals and plans; what I could borrow from them; and how I could extend them. For now, I think I want to focus on using Mappy (or at least my instrumented emulators) to create a high-quality dataset of game objects, terrains, camera movement, and so on within single rooms (the multi-room mapping registration is currently a bit buggy and I need to investigate it more). I also want to try applying these two papers’ models against the game play data I have, though I don’t think a pre-trained version of the MarioNette model or even its dataset are available (or at least I haven’t found either).