Saturday, May 12, 2007

IWRAM code vs. recompiler

(NOTE: this post pertains to any ports of gpSP using dynarec, that means the PSP and GP2X versions)

There is an issue with certain GBA games that has plagued me since I first started work on gpSP. I've tried some hacks to get around it, and they've helped, but they were very ugly. Some months ago I came up with a new idea to try to improve the situation. This idea was a bit complex and had some serious complications and I basically abandoned it when I thought of something simpler.

This is the issue: GBA can execute code from one of two places, ROM (the cartridge and the BIOS) and RAM (that includes IWRAM, EWRAM, and rarely, VRAM. gpSP doesn't support executing code from VRAM right now so that's out already). ROM can't be overwritten, so when code is encountered from ROM it's known that it'll always be the same. RAM is another story.

Most games will load a bit of code into RAM to run it quickly. Usually this code is just left alone after being loaded, so it's like ROM but faster. Sometimes it's reloaded when the game changes significantly (like entering a new area, or going from the title screen to ingame), but usually it stays the same for a long time. These modifications to code do have to be checked, because it means that the code that has been dynamically recompiled has to be flushed (since it is no longer valid). Normally all of the code that was in RAM is flushed, since the buffers aren't setup in a way that allow you to flush part of it, and doing so would make any other blocks that branch into the modified block invalid. The operation to scan and update all of RAM would probably be as slow or slower than just recompiling all of it.

This works fine for a majority of games. There's some overhead checking if writes modify code, but it's not a lot.

The problem comes when games modify code in RAM a LOT. Some games only change a few bytes, but constantly. This is known as self-modifying code. It's a pretty old technique that's shunned these days because it doesn't get along well with caching. The GBA doesn't have cache, so some games still do this. It may be the case that some games also load in huge chunks of code to swap it in and out of RAM constantly, which isn't quite the same as self-modifying code but appears the same. This usually happens over DMA. From what I've seen not a lot of code in IWRAM is modified per-frame, even when DMA is used. Some games are very sloppy though, like those written by Camelot. They will DMA the same thing into IWRAM over and over again.

Anyway, even if only a few bytes are changed, with the way things work now it'll cause everything that has been compiled to be flushed. So I came up with a solution to help change this. Now I have two buffers for RAM code, a "static" one and a "dynamic" one. There are counters for code regions associated with self modifying code writes. When something has been written to enough times the region changes from static to dynamic. Now when code in that region is recompiled it will go to the dynamic buffer instead of the static one. The consequence of this is that when this code is modified in the future only the dynamic buffer needs to be flushed. If the amount of code being rewritten is relatively small then this dynamic region will be minimized.

I haven't done any real world speed tests on this yet, but I have determined that this method is causing far fewer bytes to be recompiled per frame. There's still quite a bit of overhead in flushing, but this can be alleviated tremendously and cheaply by using region tables. I haven't actually looked at numbers from the original implementation, actually, I was comparing it with dynamic + static buffers but with both being flushed on SMC writes. This would probably still perform better than the original because the "translation gate" custom optimization was effectively being performed dynamically.

Translation gates are a hack I added to try to prevent things from "overcompiling." The issue is that the recompiler is very deep, and will compile as much as it can. If a region of code is changed then the recompilation buffers will be flushed. What happens next is that to resume CPU execution a pointer to the currently translated code has to be found. If the code was running in ROM then this is fine, it'll still be there. But if it was running in RAM (and Camelot games like to modify code that is dangerously close to the code that's currently running) then it'll have to recompile it again. If the deep recompilation ends up catching the code that was modified then it'll recompile it over and over again, every time it's modified. If the code is modifying a huge block in a loop then this means every iteration of the loop will cause a recompilation of this code. What we really want is to make it so this code isn't recompiled until it's actually used (and thus the writing is long done). The translation gates were like breakpoints that I determined by looking at the code in a debugger. They told the recompiler to stop compiling when they're hit, and generate an indirect branch that's resolved at runtime, to get to the next portion.

In short, the static/dynamic analysis finds these breakpoints automatically, and it does a better job than I was doing with the translation gates (although it does add overhead of more indirect branches at runtime). But the selective flushing adds a huge gain on top of this.

I don't remember the exact numbers, but I do know that Golden Sun recompiles far less per frame (I think it was about 500-3000 bytes now compared to over 10000 bytes before). However, this improvement is nothing compared to the one in Duke Nukem Advance and Doom 2. These were extremely pathological in how they hit the recompiler, apparently they were compiling several dozens of thousands of bytes per frame, even well over 100,000. Now they typically recompile under 1000. I expect this to make a huge difference in performance, although the GP2X version will need more optimized video writing code to see good performance in these games (the PSP version doesn't have this problem).

There are perhaps still some things I can do to make things better. I can segment the dynamic cache into several regions, so this way if a number of places are being modified they can each be flushed independently (in practice these locations probably don't jump between each other, but who knows?). I also had some really crazy ideas about aliasing different RAM frames that get flipped when DMA occurs, but I don't think this will ever work out because the modifications to code are just too irregular.. I expected they'd follow a cyclic pattern, and they just don't, at all.

I need to fix and refine the code a bit more (some games are broken now?), then I'm going to do some tests on the GP2X to see how the current version fares against this one for problem games.