I decided a good starting point would be to revisit the detection of camera movement in games, since it should help with object tracking to know what’s static terrain and what’s dynamic objects. I opened up the source code of the hacked Mappy interactive player Chloe was using to dump scrolling/camera movement data, and I thought about whether dumping scrolling data, game images, et cetera should be added to the standard Mappy player or kept separate as it is now. I decided to roll in the data collection code to the interactive and batch players, since I didn’t want to have separate interactive and batch scroll data dumping players.
This meant taking the ad hoc code that dumped game images and scrolling data (previously something like a 20-line diff patched in at various places in the interactive player) and creating a =ScrollDumper= struct which offers initialization (making data directories), update (recording the data), and finalize (writing the data to disk) functions. Now I can use this same struct in the batch player when I’m ready to move it over, or potentially lift it into Mappy proper; it also means I could generalize to e.g. performance stat dumping, sprite data dumping, et cetera. For the time being, it just outputs screenshots, a CSV with scrolling data (relative movement frame to frame), and the sequence of inputs used to produce them.
In the code, it’s stored as an Option
so that collection of this data can easily be turned on and off. In the future it’s possible it could implement some data-collection trait and be put into a Vec<Box>
or something. But I don’t need that yet.
In the process of doing this, I changed the screenshot dumping code to use the existing framebuffer structure (and encode it to PNG) rather than making a copy of the framebuffer. With that bit of programming done, I tried out the interactive player in data-dumping mode on “Super Mario Bros.”, and verified by hand that it was outputting reasonable scrolling data.
The data look okay, so the next step is to state the problem specifically: Given a series of screenshots (let’s say two screenshots, and grayscale is fine), we want to describe the camera movement from one to the other as two numbers (and ideally get a bounding box for the part of the image that actually scrolls, but we can leave that for later). One complicating factor is that games usually run at 30 or 60 frames per second, but it’s quite costly to collect and use data at that framerate; moreover, actually doing inference isn’t free either and needing to do it 30 times a second—along with all the other stuff we want to do every frame in Mappy—seems untenable. So we’ll aim for the neighborhood of 10 frames per second.
If we could really work at that high framerate, we could assume that individual camera movements are very small (plus or minus four pixels per frame feels reasonable; usually on the NES it will be something like a repeating pattern of 2 pixels, 2 pixels, 3 pixels for an average of 2.33 pixels per frame). But we can’t handle samples that frequently, so the nice discrete classification problem (which of these 8 values is the horizontal camera movement?) turns into a fuzzier regression problem. Using more frames seems like it could help recover more accurate data, but for now let’s see how far we get by treating it like a regression problem (say, between negative 32 and 32 pixels of movement).
Since individual pixel shifts are really important here, I don’t think it makes sense to do the stacks of convolutions and pooling that are typical for computer vision problems. Or at least, if that is the move, it needs to be supplemented with skip connections so that the original image can feed through to the later stages. I read about an architecture called MLP-Mixer in Tolstikhin et al.’s “MLP-Mixer: An all-MLP Architecture for Vision”, and I guess this is a special case of something called a Vision Transformer. These approaches break up an image into non-overlapping patches and use an MLP to do feature extraction on each patch individually (or on some linear projection of those patches) before assembling these patches using some mechanism (they also use skip-connections to push original image data through to the later stages).
I will look into these kinds of techniques more deeply if something much simpler based on 3D convolutions (previous-frame and current-frame) followed by a multi-layer perceptron (perhaps with the time interval between the frames fed in) doesn’t do the trick. This problem feels simpler than classification and I don’t see why it should need so many parameters and so much training data. My next step is to collect data on some games—let’s say “Kid Icarus”, “Super Mario Bros. 3”, “Batman”, “The Legend of Zelda”, “The Guardian Legend”, and “Dragon Quest IV”. It would be great if I could slam images from a number of games into one model and have it learn to see scrolling in other, not-yet-seen games. The biggest win would be if I could get scrolling data from one or two specific Super Nintendo games (e.g., “Super Mario World” or “The Legend of Zelda: A Link to the Past” or “Super Metroid”) and see if the model could generalize.
All that sounds cool, but to publish a paper on it I’d probably also have to show that this helps with some other problem like background subtraction or object detection. I think I can see how this would fit together—if we knew the camera movement we could see when objects are moving in ways that differ from the camera, and those would be our foreground. There are other applications too including localization and mapping, and I think that should be enough to get something publishable.