Overview, 09/26/23

The key open questions in moving Mappy away from depending on instrumented NES emulators are:

  1. What representation should be used for space? Tiles and pixels have some issues, objects are good but very under-specified as a notion.
  2. Can we avoid the heavy instrumentation for camera movement detection? It’s very NES-specific as well as one of the few things (the only thing?) we need a custom emulator core for.
  3. We want to cleanly separate layers by foreground/background or segment by objects, so that we don’t need sprite tracking or grid identification/tile alignment.

I wanted to use Mappy to generate datasets for machine learning for a while now. Chloe Sun ’23 did some work on scroll-detection last year or the year before (using scroll data from Mappy) but we didn’t have much awareness of computer vision ML architectures, so it was fragile. Chanha Kim ’22 and Jaden Kim ’22 did some good work applying YOLO to game datasets using synthetic data (not Mappy-generated tags, because Mappy at the time didn’t do blobbing). So I am feeling pretty strongly that doing any of (1,2,3) depends on having a good pipeline for collecting data about game play.

I read two cool papers recently that got me thinking about this area again. Harley et al. had an interesting one called “Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories”, which takes a series of image frames and a set of pixels of the initial image to track, and it identifies the movements of those particles over time. It’s neat that it can track specific target pixels, and it’s neat that it tracks through occlusions as well.

This seems promising for something like the sprite tracking that Mappy needs to do, but it doesn’t solve everything. First, it can’t distinguish camera from object movement because it’s a low level technique, at the level of tracking optical flow—for featureless Super Mario backgrounds, I think it would see platforms/terrain as moving left through space as Mario moves rightwards through the level, and if Mario and the camera move at the same speed it would think Mario is stationary. So there would be some work to back out “scene movement” from “object movement” which depends on knowing what’s static terrain and what’s foreground (a scrolling model like the one Chloe and I tried to figure out would be helpful here).

The second thing to think about is that particles are not objects—a grid of particles could be used to approximate a set of objects (e.g., groups of particles that move together are probably the same object), but that’s an extra set of steps. And grids aren’t always what we want; they’re great for Mario but less good for Super Metroid. Still, it seems like a great foundational technique. I only saw one or two performance numbers in the paper (I guess it depends on your GPU, resolution, and everything else), but if it were fast enough for interactive use that would be really cool; I doubt it though, since it wants 200ms for 8 frames at 480×1024 resolution, and I would need it at least one and hopefully two orders of magnitude faster. Another thought: I could borrow the 8-frame tracking idea that leverages both appearance similarity and movement prior to help make Mappy’s existing sprite tracking more robust (it doesn’t really use a movement prior or anything if I recall correctly).

The other paper that I’ve been wanting to read for a while is “MarioNette: Self-Supervised Sprite Learning” by Smirnov et al. Again, it has some cool tricks that I would like to borrow although I don’t think I can use it wholesale. The coolest tricks to me are the idea of learning a sprite dictionary in a self-supervised fashion, and the idea of training a model to learn a scene representation from which the original scene is reconstructed. This scene representation idea breaks a mental block I had since my previous attempts in this area were focused on pixel representations. I’m not sure their representation is exactly what I would want to use, but there is a lot to recommend it. It’s kind of complicated, but a brief summary is that a screen layer is a coarse grid of points, and each point may or may not have a sprite connected to it at some (x,y) offset. Which sprite it is depends on how well the point’s local features match those of any witnessed sprite in the dictionary.

I don’t love that this has a fixed sprite size hyperparameter, and that it depends on either learning one big static background image (and each frame is registered on some offset of that image) or on being provided a fixed background color. But could it be combined with some oracle for background detection? Such a thing could be trained on an emulator that renders the background separately from non-background colors, but the NES PPU’s abstraction of “background” is leaky, it’s purely visual and not semantic (the problem is even worse on SNES with its many layers).

So I read these papers and thought about: how they fit into my goals and plans; what I could borrow from them; and how I could extend them. For now, I think I want to focus on using Mappy (or at least my instrumented emulators) to create a high-quality dataset of game objects, terrains, camera movement, and so on within single rooms (the multi-room mapping registration is currently a bit buggy and I need to investigate it more). I also want to try applying these two papers’ models against the game play data I have, though I don’t think a pre-trained version of the MarioNette model or even its dataset are available (or at least I haven’t found either).