gpSP development: 2007

Saturday, May 12, 2007

IWRAM code vs. recompiler

(NOTE: this post pertains to any ports of gpSP using dynarec, that means the PSP and GP2X versions)

There is an issue with certain GBA games that has plagued me since I first started work on gpSP. I've tried some hacks to get around it, and they've helped, but they were very ugly. Some months ago I came up with a new idea to try to improve the situation. This idea was a bit complex and had some serious complications and I basically abandoned it when I thought of something simpler.

This is the issue: GBA can execute code from one of two places, ROM (the cartridge and the BIOS) and RAM (that includes IWRAM, EWRAM, and rarely, VRAM. gpSP doesn't support executing code from VRAM right now so that's out already). ROM can't be overwritten, so when code is encountered from ROM it's known that it'll always be the same. RAM is another story.

Most games will load a bit of code into RAM to run it quickly. Usually this code is just left alone after being loaded, so it's like ROM but faster. Sometimes it's reloaded when the game changes significantly (like entering a new area, or going from the title screen to ingame), but usually it stays the same for a long time. These modifications to code do have to be checked, because it means that the code that has been dynamically recompiled has to be flushed (since it is no longer valid). Normally all of the code that was in RAM is flushed, since the buffers aren't setup in a way that allow you to flush part of it, and doing so would make any other blocks that branch into the modified block invalid. The operation to scan and update all of RAM would probably be as slow or slower than just recompiling all of it.

This works fine for a majority of games. There's some overhead checking if writes modify code, but it's not a lot.

The problem comes when games modify code in RAM a LOT. Some games only change a few bytes, but constantly. This is known as self-modifying code. It's a pretty old technique that's shunned these days because it doesn't get along well with caching. The GBA doesn't have cache, so some games still do this. It may be the case that some games also load in huge chunks of code to swap it in and out of RAM constantly, which isn't quite the same as self-modifying code but appears the same. This usually happens over DMA. From what I've seen not a lot of code in IWRAM is modified per-frame, even when DMA is used. Some games are very sloppy though, like those written by Camelot. They will DMA the same thing into IWRAM over and over again.

Anyway, even if only a few bytes are changed, with the way things work now it'll cause everything that has been compiled to be flushed. So I came up with a solution to help change this. Now I have two buffers for RAM code, a "static" one and a "dynamic" one. There are counters for code regions associated with self modifying code writes. When something has been written to enough times the region changes from static to dynamic. Now when code in that region is recompiled it will go to the dynamic buffer instead of the static one. The consequence of this is that when this code is modified in the future only the dynamic buffer needs to be flushed. If the amount of code being rewritten is relatively small then this dynamic region will be minimized.

I haven't done any real world speed tests on this yet, but I have determined that this method is causing far fewer bytes to be recompiled per frame. There's still quite a bit of overhead in flushing, but this can be alleviated tremendously and cheaply by using region tables. I haven't actually looked at numbers from the original implementation, actually, I was comparing it with dynamic + static buffers but with both being flushed on SMC writes. This would probably still perform better than the original because the "translation gate" custom optimization was effectively being performed dynamically.

Translation gates are a hack I added to try to prevent things from "overcompiling." The issue is that the recompiler is very deep, and will compile as much as it can. If a region of code is changed then the recompilation buffers will be flushed. What happens next is that to resume CPU execution a pointer to the currently translated code has to be found. If the code was running in ROM then this is fine, it'll still be there. But if it was running in RAM (and Camelot games like to modify code that is dangerously close to the code that's currently running) then it'll have to recompile it again. If the deep recompilation ends up catching the code that was modified then it'll recompile it over and over again, every time it's modified. If the code is modifying a huge block in a loop then this means every iteration of the loop will cause a recompilation of this code. What we really want is to make it so this code isn't recompiled until it's actually used (and thus the writing is long done). The translation gates were like breakpoints that I determined by looking at the code in a debugger. They told the recompiler to stop compiling when they're hit, and generate an indirect branch that's resolved at runtime, to get to the next portion.

In short, the static/dynamic analysis finds these breakpoints automatically, and it does a better job than I was doing with the translation gates (although it does add overhead of more indirect branches at runtime). But the selective flushing adds a huge gain on top of this.

I don't remember the exact numbers, but I do know that Golden Sun recompiles far less per frame (I think it was about 500-3000 bytes now compared to over 10000 bytes before). However, this improvement is nothing compared to the one in Duke Nukem Advance and Doom 2. These were extremely pathological in how they hit the recompiler, apparently they were compiling several dozens of thousands of bytes per frame, even well over 100,000. Now they typically recompile under 1000. I expect this to make a huge difference in performance, although the GP2X version will need more optimized video writing code to see good performance in these games (the PSP version doesn't have this problem).

There are perhaps still some things I can do to make things better. I can segment the dynamic cache into several regions, so this way if a number of places are being modified they can each be flushed independently (in practice these locations probably don't jump between each other, but who knows?). I also had some really crazy ideas about aliasing different RAM frames that get flipped when DMA occurs, but I don't think this will ever work out because the modifications to code are just too irregular.. I expected they'd follow a cyclic pattern, and they just don't, at all.

I need to fix and refine the code a bit more (some games are broken now?), then I'm going to do some tests on the GP2X to see how the current version fares against this one for problem games.

Tuesday, April 3, 2007

ARM alpha blending pass

I rewrote the expand pass for alpha blending in ARM ASM, which should benefit the GP2X version and any future ARM versions (remember, there are a ton of prospective ARM devices out there). Even though my version is very straightforward and something that a compiler should have been able to generate it's still much better than what GCC generated with the optimization flags on now (I don't think that it could do any better with any others, but I don't know for sure).

The more actual blending going on onscreen the more of a speedup this version might give. Here are the numbers:

Castlevania: Aria of Sorrow
Full test : 6456 ms (21.522274 ms per frame)
No blending : 5379 ms (17.930580 ms per frame)
No video : 2241 ms (7.472070 ms per frame)
No CPU : 4392 ms (14.642917 ms per frame)
No CPU/video: 361 ms (1.204903 ms per frame)

CPU speed : 1880 ms (6.267167 ms per frame)
Video speed : 4215 ms (14.050203 ms per frame)
Alpha cost : 1077 ms (3.591693 ms per frame)

This one is the biggest winner. It has a ton of blending going on onscreen. The alpha cost has lowered by about 4.13ms over the C version - it's over twice as fast now.

Mario Kart:
Full test : 8843 ms (29.476669 ms per frame)
No blending : 8129 ms (27.098631 ms per frame)
No video : 4238 ms (14.129470 ms per frame)
No CPU : 3221 ms (10.737390 ms per frame)
No CPU/video: 505 ms (1.684263 ms per frame)

CPU speed : 3733 ms (12.445207 ms per frame)
Video speed : 4604 ms (15.347200 ms per frame)
Alpha cost : 713 ms (2.378040 ms per frame)

Here we see a smaller improvement, because the alpha cost wasn't that large to begin with. That's because alpha is only actually turned on for a small part of the screen. Last post I believed that brighten was used instead of alpha. This is what should have been used, but I think it's brightening by something a bit off-white. This effect is very subtle in-game, but if you turn it off (by making it choose the BOTTOM pixel, not the top) it becomes obvious it isn't there.

Still a win, and what's more, it shows that the difference between a small amount of blending and a large amount isn't that high.

FF6:
Full test : 11854 ms (39.515438 ms per frame)
No blending : 10175 ms (33.918694 ms per frame)
No video : 6133 ms (20.443438 ms per frame)
No CPU : 2643 ms (8.810610 ms per frame)
No CPU/video: 1142 ms (3.809923 ms per frame)

CPU speed : 4990 ms (16.633512 ms per frame)
Video speed : 5721 ms (19.072001 ms per frame)
Alpha cost : 1679 ms (5.596743 ms per frame)

Alpha cost is down again, but this time the smallest percentage-wise. I think that there might be other things making it slow, like heavy usage of windows. This game in particular deserves extra attention. Again, alpha isn't used on very much of the screen - it mainly just provides the gradient effect.

Next time I'll talk about some changes I'd like to try for the C code (which will hopefully impact the PSP version as well).

Monday, April 2, 2007

Profiling performance

NOTE: Even though this post is specifically geared to GP2X development the techniques it employs can be used on other platforms. I could run the profiler on the PSP version as well, it's just not as nice to print from until I finally start using PSPlink. It's possible that some optimizations I end up doing could improve performance for all versions too.

With my first release of gpSP (not the first release in general, just mine) for GP2X out of the way, I've decided to focus on all aspects of improving performance of the emulator. To make it more clear what's using up how much time resources on the GP2X I did a simple profiler that would load a savestate, run for N frames (in these cases 300, which is 5 seconds of virtual playtime), then reload the savestate and run another N frames, this time with various things turned off. If you take run A with X turned on then subtract from it run B with X turned off you can find ROUGHLY how long X took within A.

I say roughly because sometimes the time of two things both running will be greater than the sum of their parts (it's possible for this to be significantly so). The main reason for this would be cache competition since the processing can be tightly interleaved (in the case of CPU and video).

Furthermore, all the tests are run with synchronization off, so there are no artificial delays.

So, these are the tests:

Full: Everything is ran.

No alpha blending: I noticed that alpha blending can be very expensive, on all platforms. I also noticed GCC was producing shoddy code for it, which I could maybe do a better version of in ARM ASM (for the GP2X version specifically). So I decided to put this here. This is accomplished by setting a flag which forces the color mode in the BLDCNT register to 0 every scanline.

No video: I think video is the real CPU eater most of the time, so running without it focuses on the CPU (but you'll see there's more left than this). Accomplished by turning on manual frameskip and setting it really, really high (1000000).

No CPU: Kinda the inverse of the above. Makes the game frozen w/ horrible noise playing. This is accomplished by putting the CPU in HALT mode and turning off interrupts.

No CPU/Video: Here the residual things are tested. This includes audio timer performance - if the game had audio timers running in the background they'll still be running (hence the horrible noise), but the GBC channels will probably be off so they aren't tested. This also includes the main state transitions, timer updating/triggering, input polling, background stuff running on the platform (in GP2X's case it should hopefully just be the kernel, SSH server, and a few other low priority things).

And now for some actual test runs along with some commentary. All tests are ran on GP2X at 200MHz - with overclocking you'll achieve better results, but the actual numbers don't matter as much as their relativity to each other and how much they can be improved. However, you'd still ideally want one frame to take 16.67ms or less. If testing just to see how close it is to being full speed I'd do so with at least some overclocking (240-260MHz) since that's what most people will use. All tests also use the mmuhack, of course.

I tried to pick areas where the video load was pretty constant and it didn't look like there'd be a lot of fluctuations in CPU usage. Of course I avoided pressing anything while the tests ran.

Initially I put in numbers approximating the CPU speed/video speed by taking differences between full test and the test w/o the component. This ended up being heavily flawed because of my latter two tests, which change the video mode mid-frame, so disabling the CPU completely changes video performance. So what I settled on was the following:

Video speed ~= full test - no video
CPU speed ~= full test - video speed - residual speed
Alpha cost ~= full test - no blending

This means that for now the no CPU test isn't contributing to the results at the bottom, but can still be useful, especially for seeing how lighter video loads cope.

Castlevania: Aria of Sorrow: This test was done in the name entry screen. I chose this area because there is a lot of alpha blending going on onscreen.

Benchmark results (300 frames):
Full test : 7716 ms (25.722994 ms per frame)
No blending : 5398 ms (17.993380 ms per frame)
No video : 2245 ms (7.484780 ms per frame)
No CPU : 5146 ms (17.154354 ms per frame)
No CPU/video: 360 ms (1.201827 ms per frame)

CPU speed : 1884 ms (6.282953 ms per frame)
Video speed : 5471 ms (18.238213 ms per frame)
Alpha cost : 2318 ms (7.729613 ms per frame)

Here we can see the video is quite expensive, with a lot of that going to the alpha blending. The CPU isn't too bad, but of course significant and improvements in that area will hopefully show up on this test. The residual cost is relatively minor, but still takes up a chunk that can't be ignored. What should be researched is why the video cost is so high even without blending. It could be window usage, which can perhaps be optimized.

Mario Kart: Here I ran it ingame some time into the first race. This is a good example of a fairly heavy but still realistic video load (since it uses the affine transformed BG mode which is expensive, but it only uses it for about 2/3rds of the screen.. also uses a bit of alpha blending)

Benchmark results (300 frames):
Full test : 9036 ms (30.121626 ms per frame)
No blending : 8117 ms (27.058737 ms per frame)
No video : 4241 ms (14.137693 ms per frame)
No CPU : 3215 ms (10.719350 ms per frame)
No CPU/video: 505 ms (1.685850 ms per frame)

CPU speed : 3735 ms (12.451843 ms per frame)
Video speed : 4795 ms (15.983933 ms per frame)
Alpha cost : 918 ms (3.062890 ms per frame)

Now we see the CPU speed has gone up a lot. I hope for improvements to the dynarec and memory subsystem code to show up the most here. The video speed is high as well - if you take out the alpha cost then it's quite a bit higher than Aria of Sorrow's because of all the work that has to be done with scaling and rotating the backgrounds. It's possible that both this and the alpha blending can be improved with ARM ASM versions. The residual cost has gone up a little, but is within the same range. The alpha cost is a lot lower because it's not using actual alpha blending, but just color fades to brighten a few scanlines. This is cheaper to both do the base rendering on and perform the actual blending on. It's still there, of course.

Final Fantasy 6 Advance: This game is extremely demanding, especially in battle. So that's where I tested it.

Benchmark results (300 frames):
Full test : 12088 ms (40.295025 ms per frame)
No blending : 9948 ms (33.162895 ms per frame)
No video : 6123 ms (20.412947 ms per frame)
No CPU : 2663 ms (8.878957 ms per frame)
No CPU/video: 1143 ms (3.810030 ms per frame)

CPU speed : 4980 ms (16.602917 ms per frame)
Video speed : 5964 ms (19.882076 ms per frame)
Alpha cost : 2139 ms (7.132130 ms per frame)

Here we can see that the CPU speed is quite high. FF games on GBA are prone to using a lot of high frequency IRQs which incurs a good amount of expensive opcodes frequently - it's possible that switching to HLE BIOS solutions could help here. Again the video is quite expensive as well, with a lot of alpha being used to minor effect but expensively (the gradients on the menus, for instance). Of interest is the high residual cost, which has gone up quite a lot compared to the other games. This may be due to higher bitrate audio (haven't looked at the actual speeds, just a possibility). If this is the case then optimizing the audio code can improve this.

Final remarks: In one of the cases CPU didn't take much time, but for the others it was about even with video rendering. With this in mind it may be possible to gain some performance by parallelizing the two (both on GP2X and PSP) by splitting between CPU and video on the CPUs, and deciding which one gets the rest of things (but which to put where for GP2X? The choice is more obvious for PSP). This isn't at all straightforward to do however, and for framebuffer (3D) games this should be turned off. CPU and video are still heavily interleaved so you wouldn't get anything close to the performance of the slower of the two, but there could still be some improvement.

Optimizing the video code is something I'm more interested in than I have been for a while. Of course, the optimizing the CPU still takes utmost priority because frameskip can diminish the video time dramatically while still keeping the game playable (going from fs0 to fs1 makes a huge difference). And, just a little difference in CPU speed could make the difference between being able to go up a notch in fs, or more importantly, actually getting below the magical 16.7ms (preferably at reasonable clock speeds).

Tuesday, March 27, 2007

Register usage statistics

First, let me mention that I've reached the register analysis/caching step in my todo list for the dynarec. So I currently have almost all operations inlined and flags calculated natively.

As I mentioned previously, I will be performing a partial global static register allocation now. This is the simplest possible method of register allocation, where I have N registers available to me in the GP2X's register file and I want to map them to N registers on the GBA. Every time the GBA uses a register that is not one of those N it will have to be loaded from memory first.

Potentially I can free up about 10 registers for allocation. Right now I have 5 free: r3 through r7. Once I have converted some more functions from C to ASM I will be able to move the cycle count register from r8 to r12, and the saved register r9 won't really be needed. This will bring the total up to 7 registers. With a little more work a2 can be removed, and/or sp can be removed. This brings the total up to 8-9.

The reason why I chose global static register alloaction is because I believed that the register usage patterns would be less than uniformly distributed; a uniform distribution that every register is used 6.25% of the time (counting r15, which I would not cache because it is a constant). This means that even though this is the simplest scheme, it may be sufficient to cover most registers.

For the following test I ran these games for a bit each on the PC version. The interpreter w/register analysis was used.

FF: Dawn of Souls (FF2)
Golden Sun
Castlevania: Harmony of Dissonance
FF6 Advance
FFT Advance
Kirby: Nightmare in Dreamland
Mario & Luigi

I tried to pick games w/o idle loops because that would skew the results (since they're not eliminated in the interpreter). As far as I'm aware the only one with them is Kirby, and I ran that for the least amount of time. Even though most of these games are RPGs, their register usage pattern shouldn't vary from any other genre, as far as I expect (although I could be entirely wrong, I doubt it).

There are a few flaws in the analysis: register usage is not counted for the add operand in mla instructions, the PC is not counted sometimes in Thumb (but this shouldn't matter because I wouldn't pick PC anyway). I also count usage in the most general terms, and don't split it up into types of usage, although I tried to tag the usage types with more information in case I want to later. The results are therefore possibly a little crude, but should still give a mostly accurate picture of register usage.

ARM register usage (38.775138% ARM instructions):
r00: 18.263814% (-- 18.263814%)
r12: 11.531477% (-- 29.795291%)
r09: 11.500162% (-- 41.295453%)
r14: 9.063440% (-- 50.358893%)
r06: 7.837682% (-- 58.196574%)
r01: 7.401049% (-- 65.597623%)
r07: 6.778340% (-- 72.375963%)
r05: 5.445009% (-- 77.820973%)
r02: 5.427288% (-- 83.248260%)
r03: 5.293743% (-- 88.542003%)
r04: 3.601103% (-- 92.143106%)
r11: 3.207311% (-- 95.350417%)
r10: 2.334864% (-- 97.685281%)
r08: 1.708207% (-- 99.393488%)
r15: 0.311270% (-- 99.704757%)
r13: 0.295243% (-- 100.000000%)

Thumb register usage (61.224862% Thumb instructions):
r00: 34.788858% (-- 34.788858%)
r01: 26.564083% (-- 61.352941%)
r03: 10.983500% (-- 72.336441%)
r02: 8.303127% (-- 80.639567%)
r04: 4.900381% (-- 85.539948%)
r05: 3.941292% (-- 89.481240%)
r06: 3.257582% (-- 92.738822%)
r07: 2.644851% (-- 95.383673%)
r13: 1.408824% (-- 96.792497%)
r08: 0.906433% (-- 97.698930%)
r09: 0.679693% (-- 98.378623%)
r10: 0.656446% (-- 99.035069%)
r12: 0.453668% (-- 99.488737%)
r14: 0.248909% (-- 99.737646%)
r11: 0.171066% (-- 99.908713%)
r15: 0.091287% (-- 100.000000%)

First, the % of instructions counts all instructions, even ones that do not use registers. However, this is still a good measure for time spent in the different modes, so I'll be using a weighted multiplication of the two values to derive the weighted average, which I think is the most pertinent value to get out of this.

With 6 registers (roughly 40% working set minus r15)
ARM: 65.6%
Thumb: 89.5%
Unweighted average: 77.6%
Weighted average: 80%

With 7 registers: (roughly 46.7% working set minus r15)
ARM: 72.4%
Thumb 92.7%
Unweighted average: 82.6%
Weighted average: 84.8%

With 8 registers: (roughly53.3% working set minus r15)
ARM: 77.8%
Thumb: 95.4%
Unweighted average: 86.7%
Weighted average: 88.6%

So, when using only 53.3% of the registers a good 88.6% of register accesses may be mapped. This is a pretty good chunk, and is much better than the current amount which are mapped, 0%. In fact, rather than going from 7 to 8 registers it might potentially be better to use that 8th register for something else, like caching a PC base (like I do in the PSP version). Even though the PC is a constant, computing it takes a few instructions since it's a very large one.

This register usage information would of course prove useful to any other dynarec that needs it. Even the PC version might use it to free up some registers for other possibly more useful things like pointers.

Oh, btw, I thought I'd mention that doing the GP2X version helped me fix FF6 (and possibly others, it also fixes PocketNES) on the other platforms. I expected this to be a ton of work but I got it for free. Looks like working on this version was good for PSP users too.

Thursday, March 22, 2007

Status update

Finally seeing something on my GP2X, so I'll report it here.

I was originally going to tweak zodttd's arm_emit.h (the file that contains a bulk of the backend for the dynarec), but some parts were a bit obfuscated beyond what I wanted to work with, making it hard to reach some of the more critical areas I wanted to. Don't get me wrong, zodttd did a good job getting a working dynarec off the ground, but it was a bit different than what I was ready to work with.

So, since it's somewhat straightforward to do so, I decided to start over on this file from scratch. Really it works out as being a port of the x86 version, x86_emit.h. I wrote this version (the recompiler for x86) originally, and converted it to the PSP version as a base to start from.

This version is basically partially a "threaded interpreter", partially a true dynarec. The way threaded interpretation works is that instructions are compiled to calls to C functions; ideally you would then only have to emit actual machine code for loads to memory (for loading register operands), loading immediates into registers, calling functions, and storing to memory (for storing register results). However ARM has a lot of sophisticated functionality that means you'd have to either have a ton of functions to call or several functions called for some instructions to make things work. For that reason I have some things in functions, and some things inlined in assembly code, as follows:

Inlined portions: Barrel shifting, multiply/multiply accumulate instructions, direct branches, memory address calculation
Function calls: ALU operations, memory loads/stores, indirect branches

The inlined portions were originally expressed in a simple assembly language that's mostly a subset of x86, MIPS, and ARM (so it has two address operations, only a few registers, the ability to load/store registers from memory, etc). This meant that when porting the recompiler to other platforms away from x86, all I had to was rewrite the macros to generate the instructions of this subset language as instructions in the native language targeted.

There is however one thing that the subset language can do which the others can't, and that is to load large immediates. Actually, most of the time these are for compiling ARM ALU immediates, and since we're compiling to ARM this time I can retain the information. But I had to modify these primitives to take things in the ARM format (immediate + rotation) instead of just an immediate. That fortunately worked out to be very simple.

The eventual goal is to have the ALU operations inlined. Memory load/stores and indirect branches tend to be complex operations and for that reason don't get inlined, although there is some potential for inlining some memory operations down the road, depending on how things work out there.

So, right now I basically have the ARM translation kinda working; some instructions don't work yet. Thumb doesn't work at all yet (hopefully it'll be easy to resolve). I've just now started seeing results because I've been fighting with the basic GP2X development process this entire time.

This is basically a timeline of what happened:

Sunday night: Got zodttd's latest source building with DevkitGP2X
Monday: Poked around and decided I'd rewrite arm_emit.h. Didn't work on it much this day because I was very tired.
Tuesday: Mostly rewrote arm_emit.h (~1800 lines covered).
Wednesday: Finished rewriting arm_emit.h and modified arm_asm_stub.S so that things would compile. Unfortunately, here is where everything fell apart. After several instances of my binary mysteriously not updating, suddenly all my builds would segfault. I was trying to communicate with the GP2X more effectively but to no avail. And my internet connection was being lousy, so I went to sleep in a total haze, exhausted and going nowhere. Didn't get a lot done again because I was very tired >_>
Thursday: Woke up and managed to get an SSH connection to the GP2X (thanks to help from zodttd the night before). Found out that anything the compiler generated would segfault now; spent a lot of time trying to find the reason for this and got nowhere. Reinstalled the toolchain, got nowhere. Tried reinstalling another distribution of the toolchain, got nowhere. At this point I'm wondering if I've completely lost my mind. Finally decided to try installing Open2x for Cygwin like zodttd originally suggested and it magically works. With SSH working I was able to debug it for real and finally got to the point I'm at now after a few ASM dumps and actually clearing the cache in the right places (btw, some documentation on the cache syscalls would be great).

So that puts me in pretty good shape for the rest of the week. This is what I'd like to do now, roughly in this order:

- Fix the remaining ARM instructions.
- Fix Thumb (it might be just going into it that's breaking for whatever reason) and fix whatever Thumb instructions end up not working.
- Inline non flag-altering ALU instructions. This will put me to roughly where zodttd's is, although currently I do a few things better than his (immediates loaded in ARM form, branches linked directly w/o loading the PC)
- Inline flag-altering ALU instructions, caching flag results in single register.
- Write analysis backend for interpreter and write register analysis to determine most frequently used registers in ARM/Thumb mode. Will be worthwhile to put memory analysis here as well and other things perhaps..
- Run some tests on games for a few minutes each to gather analysis statistics.
- Perform register caching using whatever available registers I have and the statistics for most frequently accessed registers per mode. Being done statically this shouldn't be too bad to implement.
- Refactor mips_stub.S using C preprocessor macros. It's currently ~3500 lines and a mess to alter/maintain and especially port, so I hope to improve upon these things (here's hoping I can cut 1000+ lines from it).
- Port the memory handlers in mips_stub.S to ARM to replace the current ones in arm_asm_stub.S. This is both for performance and accuracy reasons (you'd be surprised what handling strange illegal memory accesses can fix in games)
- Gather motley crew of beta testers to hammer the thing.

That's the rough plan for now. I'm hoping that this won't take too long (well, I only have about 9 days as allotted by zodttd, maybe he'll give me an extension if I need it and have shown good progress though). It's my desire that this will give an improvement in performance that's reasonably noticeable, but unfortunately I can't promise that.. funny thing about results being nothing like you expected.

After I have these things out of the way I hope to go back to working on the core issues in GP2X, and other base optimizations to help all of the versions.

Course, I'd like to thank zodttd for working on things and giving me a hand, and also Sergey Chaban for the emitters from Vincent, which we're using to emit ARM code.

Let's see if tomorrow is productive at all..

Sunday, March 18, 2007

Finally...

I finally got around to getting zodttd's gpSP source compiling for GP2X, and have ran it on mine as well (via NAND - will setup USB server sometime soon). Tried Castlevania: Aria of Sorrow; it runs fullspeed with frameskip 2 at 275MHz. Not bad, although there's certainly room for improvement. I asked zodttd to let me have a couple weeks of exclusive access to the code so I can optimize it some. We'll see how that goes.

I still need an AC adapter though, these batteries will only last so long. I don't suppose anyone would like to donate one to me? :x

General roadmap

Now that I've got some GP2X related posts out of the way I should go over some of the more overarching concerns and plans I have regarding gpSP in general (in other words, the PSP version). The next version would be a 1.0, which by its name should mean that it's rather "complete." Of course, gpSP will never be complete, and I'm sure even version 1.0 will have its share of obvious flaws, but I'd like to push it closer than it is now.

There are a few main areas where improvement can be done. These are pretty obvious, but I'll go over them anyway.

- Compatability. A few games are not working correctly, ranging from minor to severe problems. Right now the biggest culprit is Final Fantasy VI, but there are others, such as Sims: Pets. Games such as Golden Sun 2 have buggy sound. Some other games I need to test more thoroughly to see if they have problems on the PSP version in particular.

Broken games fall into a few categories. Some work only on the PC version and not the PSP version. Some work only on the interpreter and the dynarec. There are a few that only work on PSP and only work on the dynarec; these I'm not terribly concerned about. As of now all games that I'm aware of that are critically broken don't work anywhere (interpreter or dynarec). This doesn't mean there aren't dynarec bugs left for the PC/PSP versions (which have been synchronized), but I don't know of any at the moment. Unfortunately, the games that don't work for any configuration of gpSP are far the hardest to deal with.

Debugging these games at this point is extremely time consuming and trying. I have put hours into FF6 in particular but have gotten nowhere. Often it's difficult to even know where to begin when trying to find what's wrong with something. The best aid I've had is using the debuggers in other emulators where the game works (such as VBA) to try to track where the emulation diverges from gpSP's. The big problem with this method is that the timing in the two emulators are never going to be the same, even with coercion to try to make it so, so the games will always harmlessly diverge at some point. For this reason a more high level approach has to be taken to examining where the two are different. This is usually feasible when the game has a very obvious problem (such as crashes at a certain point) but can be very difficult when the problems are more subtle than that.

- Performance. I'll cover this more in a later post; the bottom line is that there is still room to make gpSP faster, especially for certain games that don't get along with it well right now.

- Features. The cheat support needs to be improved, I'd like for there to be a low pass filter option for audio (I tried this, and failed..). Of course, everyone wants wifi multiplayer but I don't know what the prospects are for such a thing. That, and I don't have a second PSP or a wireless router.

I'll divulge more details about some of these things later, and as I make any improvements I'll post that here as well.

Why not use virtualization instead of recompilation for GP2X?

This question comes up a lot on the gp32x boards. The idea is that since GP2X has a processor which is binary compatible with GBA and has an MMU, why not just run GBA code "directly" on the GP2X? Unfortunately, things are not that simple. There are a few problems with this:

- The kernel has to be hacked to get a purely custom address space setup. This is doable but tricky.

- The VMM (virtual machine manager, ie, the rest of the emulator) needs to run somewhere. It can run in a different address space, but this would make the transitions (which are often) costly. If it runs in the GBA's address space it could conflict with the way the GBA works, although the real GBA only uses a portion of the physical address space so this probably wouldn't be an issue.

- The emulated code must always be ran in user mode, otherwise it can trample over the emulator itself. Unfortunately, it's not possible to "trap" instructions which can read the state of the CPU, so when the GBA code reads the flags it will get incorrect results and probably fail.

- The same holds for changing the machine state such as disabling interrupts.

- GBA has a lot of components which are time sensitive, such as 4 timers and the screen. Both are very high resolution and cannot be adequetly emulated by the GP2X's timers which run at a different clock rate. For this reason instructions need to be ran within certain intervals to properly serialize them within events, but this cannot be done in hardware with virtualization.

For these reasons I don't forsee virtualization occuring. Commercial grade virtualization emulators like VMWare have until recently only virtualized "user code", while performing recompilation on kernel code. This has only changed due to added hardware support for virtualization in x86 CPUs, support which the ARM9 CPUs in GP2X does not have (as far as I'm aware). It might be possible to pull the same trick on GBA, but even in user mode code the CPU is free to change to other modes at will (and almost always does, for instance, at startup, but can in other places) and there are in fact few restrictions present. The timing issues would still remain as well; for x86 software timing requirements are usually much looser, but on GBA the code must be serialized somewhat properly to within the scanline level.

gpSP for GP2X

These days gpSP is becoming less and less tied to PSP specifically, in particular thanks to the efforts of zodttd. He originally ported gpSP to the GP2X some months ago, but it was a pretty straight forward port (mostly based on the framework for the PC version, which is fairly portable via SDL) only utilizing the portable interpreter for CPU emulation. Since then I had intended to work on a dynamic recompiler to bring performance up to speed, but real life got in the way (plus I haven't had a very working GP2X setup). Fortunately, zodttd managed to pull one off and things are looking pretty good these days. The first public version was made available a few days ago, and I hear that it runs a number of games very well. Of course, there's still room for improvement.

I haven't actually been able to test it yet, but I hope to have an Open2X setup for cygwin as soon as possible. I'll also need a means of powering the thing, either with an AC adapter (preferred, but far too overpriced), or battery recharging. Then, once I'm able to contact zodttd (something I should have done yesterday but managed to collapse and not get back up for a while..) we can go over means of cleaning it up and improving it. There are a few things I've been thinking about for dynarec improvements in general that should apply to both the GP2X and the PSP versions (and others if they surface).

Right now the dynarec will benefit the most from two things: native flag calculation and register allocation.

A quick crash course on the ARM CPU, the CPU that both the GBA and the GP2X have: there are 15 "general purpose" registers where temporary values may be freely loaded and stored. Arithmetic/logical operations may optionally set "flags" which reflect the results; this can be used to conditionally execute instructions. Indirectly speaking, the flags are usually used to test if a variable hit zero, or if one is greater/less than another, and so on.

Because the GBA and GP2X have the same CPU the flag generation behaves the same way, and you should be able to use the GP2X's CPU to generate the flag results in hardware, and also keep them stored in its flags register. There are two caveats to this:

- Sometimes you'll need to perform calculations that will modify flags that are not part of the GBA code but are necessary for emulating it, for instance emulating memory or calling external functions (IE, to update the screen) that will probably modify flags. When these cases occur the flags should be saved then restored to a temporary register or on the stack.

- The flags register actually holds other information for determining the current state of the CPU (if it's in ARM or Thumb mode, if interrupts are enabled, what mode the CPU is in, etc). Consequently, the state of the CPU that the GP2X is in will often differ from the state the GBA code should see. What this amounts to is reconstructing the flags register from the native GP2X one and an emualted one whenever it is read. Likewise, writes to this register must be handled with care (but that's a given). This is a minor issue, but ones like this make complete "virtualization" impossible; I'll talk more about this later.

Right now the dynarec is calling C functions to perform operations involving flags, which is fairly costly. This is especially so for "Thumb" code which performs flags modifications often. On GBA Thumb code is being executed a large percentage of the time (varies depending on the game, is typically 50% or more). In the future flags modifying or utilizing operations will instead be inlined to single ARM instructions.

The next big win will be in register allocation. When emulating a platform where you have significantly less CPU registers than the machine you're recompiling to you can use a complete allocation where every register on the emulated machine maps to a register on the native machine. For the PSP version of gpSP, this is exactly what happens. Unfortunately, in the case of GP2X you have exactly the same number of registers. We cannot perform a complete mapping because we need a few additional registers to perform the emulation, and retain the internal state of the GP2X.

There are two approaches that can be taken at this point: dynamic or static register allocation. Dynamic allocation will load registers into memory as necessary, then keep them in registers until another load forces them out, using a replacement strategy such as "least recently used." This is identical to how dynamic caches such as CPU cache and virtual memory work. Compilers will also use a register allocation scheme such as this, but with more effective (and considerably more computationally expensive) algorithms for allocating registers. This is the approach typically taken by recompilers in emulators, especially when there are far fewer registers on the machine you're emulating on, as is the case with MIPS->x86 (as seen in PS1/PS2 and N64 emulators for PC).

However, dynamic allocation isn't easy to use with a recompiler such as gpSP's because translated code blocks are "self re-entrant", meaning that they can branch to within themselves. This runs into a problem which applies to various areas of recompilation; anywhere you have multiple entry points in a code block, you have multiple conflicting states and can no longer carry on prior state. This can be rectified by flushing the registers back to memory at this point, but this adds a lot of overhead, particularly in code size.

Instead, I think it's worthwhile to opt for a global static allocation scheme. Here, a fixed allocation is chosen for the entire execution of the game; that means that certain GBA registers will always be in GP2X registers, and the others will always be in memory. This provides less efficient allocation so registers will probably be in memory more often, however, it means that registers do not need to be "spilled" back to memory inbetween blocks. It also sidesteps the above problems. Since we should be able to store about 67% or more of the GBA's registers in GP2X registers this should result in instruction operands being in GP2X registers at least 67% time. Fortunately registers do not tend to be used uniformly; those top 67% registers may in fact be used more often than the bottom 33%.

To figure out which registers to use some profiling on GBA games will have to be undertaken; this should be done by modifying the PC version's interpreter and running some games to generate percentage usage of registers.

Blog for gpSP development up and running

So, I decided I'd start up a blog for gpSP development much like StrmnNrmn's blog for Daedalus development. I haven't liked talking publically about development much in the past, but since gpSP has matured a lot and people are more content with it (and hence nagging me a lot less for releases) I figured I'd post things here so I could keep everyone informed and possibly get some feedback/ideas from coders.

I haven't actually had the opportunity to work on anything very much since gpSP 0.9 has been relaesed, due to a lot of work and real life stuff going on, but I hope to be able to pick up again now. The next few posts will start revealing some of the underlying ideas I have for the upcoming version(s).

Feel free to post comments, but try to keep them on topic.