Tuesday, March 27, 2007

Register usage statistics

First, let me mention that I've reached the register analysis/caching step in my todo list for the dynarec. So I currently have almost all operations inlined and flags calculated natively.

As I mentioned previously, I will be performing a partial global static register allocation now. This is the simplest possible method of register allocation, where I have N registers available to me in the GP2X's register file and I want to map them to N registers on the GBA. Every time the GBA uses a register that is not one of those N it will have to be loaded from memory first.

Potentially I can free up about 10 registers for allocation. Right now I have 5 free: r3 through r7. Once I have converted some more functions from C to ASM I will be able to move the cycle count register from r8 to r12, and the saved register r9 won't really be needed. This will bring the total up to 7 registers. With a little more work a2 can be removed, and/or sp can be removed. This brings the total up to 8-9.

The reason why I chose global static register alloaction is because I believed that the register usage patterns would be less than uniformly distributed; a uniform distribution that every register is used 6.25% of the time (counting r15, which I would not cache because it is a constant). This means that even though this is the simplest scheme, it may be sufficient to cover most registers.

For the following test I ran these games for a bit each on the PC version. The interpreter w/register analysis was used.

FF: Dawn of Souls (FF2)
Golden Sun
Castlevania: Harmony of Dissonance
FF6 Advance
FFT Advance
Kirby: Nightmare in Dreamland
Mario & Luigi

I tried to pick games w/o idle loops because that would skew the results (since they're not eliminated in the interpreter). As far as I'm aware the only one with them is Kirby, and I ran that for the least amount of time. Even though most of these games are RPGs, their register usage pattern shouldn't vary from any other genre, as far as I expect (although I could be entirely wrong, I doubt it).

There are a few flaws in the analysis: register usage is not counted for the add operand in mla instructions, the PC is not counted sometimes in Thumb (but this shouldn't matter because I wouldn't pick PC anyway). I also count usage in the most general terms, and don't split it up into types of usage, although I tried to tag the usage types with more information in case I want to later. The results are therefore possibly a little crude, but should still give a mostly accurate picture of register usage.

ARM register usage (38.775138% ARM instructions):
r00: 18.263814% (-- 18.263814%)
r12: 11.531477% (-- 29.795291%)
r09: 11.500162% (-- 41.295453%)
r14: 9.063440% (-- 50.358893%)
r06: 7.837682% (-- 58.196574%)
r01: 7.401049% (-- 65.597623%)
r07: 6.778340% (-- 72.375963%)
r05: 5.445009% (-- 77.820973%)
r02: 5.427288% (-- 83.248260%)
r03: 5.293743% (-- 88.542003%)
r04: 3.601103% (-- 92.143106%)
r11: 3.207311% (-- 95.350417%)
r10: 2.334864% (-- 97.685281%)
r08: 1.708207% (-- 99.393488%)
r15: 0.311270% (-- 99.704757%)
r13: 0.295243% (-- 100.000000%)

Thumb register usage (61.224862% Thumb instructions):
r00: 34.788858% (-- 34.788858%)
r01: 26.564083% (-- 61.352941%)
r03: 10.983500% (-- 72.336441%)
r02: 8.303127% (-- 80.639567%)
r04: 4.900381% (-- 85.539948%)
r05: 3.941292% (-- 89.481240%)
r06: 3.257582% (-- 92.738822%)
r07: 2.644851% (-- 95.383673%)
r13: 1.408824% (-- 96.792497%)
r08: 0.906433% (-- 97.698930%)
r09: 0.679693% (-- 98.378623%)
r10: 0.656446% (-- 99.035069%)
r12: 0.453668% (-- 99.488737%)
r14: 0.248909% (-- 99.737646%)
r11: 0.171066% (-- 99.908713%)
r15: 0.091287% (-- 100.000000%)

First, the % of instructions counts all instructions, even ones that do not use registers. However, this is still a good measure for time spent in the different modes, so I'll be using a weighted multiplication of the two values to derive the weighted average, which I think is the most pertinent value to get out of this.

With 6 registers (roughly 40% working set minus r15)
ARM: 65.6%
Thumb: 89.5%
Unweighted average: 77.6%
Weighted average: 80%

With 7 registers: (roughly 46.7% working set minus r15)
ARM: 72.4%
Thumb 92.7%
Unweighted average: 82.6%
Weighted average: 84.8%

With 8 registers: (roughly53.3% working set minus r15)
ARM: 77.8%
Thumb: 95.4%
Unweighted average: 86.7%
Weighted average: 88.6%

So, when using only 53.3% of the registers a good 88.6% of register accesses may be mapped. This is a pretty good chunk, and is much better than the current amount which are mapped, 0%. In fact, rather than going from 7 to 8 registers it might potentially be better to use that 8th register for something else, like caching a PC base (like I do in the PSP version). Even though the PC is a constant, computing it takes a few instructions since it's a very large one.

This register usage information would of course prove useful to any other dynarec that needs it. Even the PC version might use it to free up some registers for other possibly more useful things like pointers.

Oh, btw, I thought I'd mention that doing the GP2X version helped me fix FF6 (and possibly others, it also fixes PocketNES) on the other platforms. I expected this to be a ton of work but I got it for free. Looks like working on this version was good for PSP users too.

Thursday, March 22, 2007

Status update

Finally seeing something on my GP2X, so I'll report it here.

I was originally going to tweak zodttd's arm_emit.h (the file that contains a bulk of the backend for the dynarec), but some parts were a bit obfuscated beyond what I wanted to work with, making it hard to reach some of the more critical areas I wanted to. Don't get me wrong, zodttd did a good job getting a working dynarec off the ground, but it was a bit different than what I was ready to work with.

So, since it's somewhat straightforward to do so, I decided to start over on this file from scratch. Really it works out as being a port of the x86 version, x86_emit.h. I wrote this version (the recompiler for x86) originally, and converted it to the PSP version as a base to start from.

This version is basically partially a "threaded interpreter", partially a true dynarec. The way threaded interpretation works is that instructions are compiled to calls to C functions; ideally you would then only have to emit actual machine code for loads to memory (for loading register operands), loading immediates into registers, calling functions, and storing to memory (for storing register results). However ARM has a lot of sophisticated functionality that means you'd have to either have a ton of functions to call or several functions called for some instructions to make things work. For that reason I have some things in functions, and some things inlined in assembly code, as follows:

Inlined portions: Barrel shifting, multiply/multiply accumulate instructions, direct branches, memory address calculation
Function calls: ALU operations, memory loads/stores, indirect branches

The inlined portions were originally expressed in a simple assembly language that's mostly a subset of x86, MIPS, and ARM (so it has two address operations, only a few registers, the ability to load/store registers from memory, etc). This meant that when porting the recompiler to other platforms away from x86, all I had to was rewrite the macros to generate the instructions of this subset language as instructions in the native language targeted.

There is however one thing that the subset language can do which the others can't, and that is to load large immediates. Actually, most of the time these are for compiling ARM ALU immediates, and since we're compiling to ARM this time I can retain the information. But I had to modify these primitives to take things in the ARM format (immediate + rotation) instead of just an immediate. That fortunately worked out to be very simple.

The eventual goal is to have the ALU operations inlined. Memory load/stores and indirect branches tend to be complex operations and for that reason don't get inlined, although there is some potential for inlining some memory operations down the road, depending on how things work out there.

So, right now I basically have the ARM translation kinda working; some instructions don't work yet. Thumb doesn't work at all yet (hopefully it'll be easy to resolve). I've just now started seeing results because I've been fighting with the basic GP2X development process this entire time.

This is basically a timeline of what happened:

Sunday night: Got zodttd's latest source building with DevkitGP2X
Monday: Poked around and decided I'd rewrite arm_emit.h. Didn't work on it much this day because I was very tired.
Tuesday: Mostly rewrote arm_emit.h (~1800 lines covered).
Wednesday: Finished rewriting arm_emit.h and modified arm_asm_stub.S so that things would compile. Unfortunately, here is where everything fell apart. After several instances of my binary mysteriously not updating, suddenly all my builds would segfault. I was trying to communicate with the GP2X more effectively but to no avail. And my internet connection was being lousy, so I went to sleep in a total haze, exhausted and going nowhere. Didn't get a lot done again because I was very tired >_>
Thursday: Woke up and managed to get an SSH connection to the GP2X (thanks to help from zodttd the night before). Found out that anything the compiler generated would segfault now; spent a lot of time trying to find the reason for this and got nowhere. Reinstalled the toolchain, got nowhere. Tried reinstalling another distribution of the toolchain, got nowhere. At this point I'm wondering if I've completely lost my mind. Finally decided to try installing Open2x for Cygwin like zodttd originally suggested and it magically works. With SSH working I was able to debug it for real and finally got to the point I'm at now after a few ASM dumps and actually clearing the cache in the right places (btw, some documentation on the cache syscalls would be great).

So that puts me in pretty good shape for the rest of the week. This is what I'd like to do now, roughly in this order:

- Fix the remaining ARM instructions.
- Fix Thumb (it might be just going into it that's breaking for whatever reason) and fix whatever Thumb instructions end up not working.
- Inline non flag-altering ALU instructions. This will put me to roughly where zodttd's is, although currently I do a few things better than his (immediates loaded in ARM form, branches linked directly w/o loading the PC)
- Inline flag-altering ALU instructions, caching flag results in single register.
- Write analysis backend for interpreter and write register analysis to determine most frequently used registers in ARM/Thumb mode. Will be worthwhile to put memory analysis here as well and other things perhaps..
- Run some tests on games for a few minutes each to gather analysis statistics.
- Perform register caching using whatever available registers I have and the statistics for most frequently accessed registers per mode. Being done statically this shouldn't be too bad to implement.
- Refactor mips_stub.S using C preprocessor macros. It's currently ~3500 lines and a mess to alter/maintain and especially port, so I hope to improve upon these things (here's hoping I can cut 1000+ lines from it).
- Port the memory handlers in mips_stub.S to ARM to replace the current ones in arm_asm_stub.S. This is both for performance and accuracy reasons (you'd be surprised what handling strange illegal memory accesses can fix in games)
- Gather motley crew of beta testers to hammer the thing.

That's the rough plan for now. I'm hoping that this won't take too long (well, I only have about 9 days as allotted by zodttd, maybe he'll give me an extension if I need it and have shown good progress though). It's my desire that this will give an improvement in performance that's reasonably noticeable, but unfortunately I can't promise that.. funny thing about results being nothing like you expected.

After I have these things out of the way I hope to go back to working on the core issues in GP2X, and other base optimizations to help all of the versions.

Course, I'd like to thank zodttd for working on things and giving me a hand, and also Sergey Chaban for the emitters from Vincent, which we're using to emit ARM code.

Let's see if tomorrow is productive at all..

Sunday, March 18, 2007

Finally...

I finally got around to getting zodttd's gpSP source compiling for GP2X, and have ran it on mine as well (via NAND - will setup USB server sometime soon). Tried Castlevania: Aria of Sorrow; it runs fullspeed with frameskip 2 at 275MHz. Not bad, although there's certainly room for improvement. I asked zodttd to let me have a couple weeks of exclusive access to the code so I can optimize it some. We'll see how that goes.

I still need an AC adapter though, these batteries will only last so long. I don't suppose anyone would like to donate one to me? :x

General roadmap

Now that I've got some GP2X related posts out of the way I should go over some of the more overarching concerns and plans I have regarding gpSP in general (in other words, the PSP version). The next version would be a 1.0, which by its name should mean that it's rather "complete." Of course, gpSP will never be complete, and I'm sure even version 1.0 will have its share of obvious flaws, but I'd like to push it closer than it is now.

There are a few main areas where improvement can be done. These are pretty obvious, but I'll go over them anyway.

- Compatability. A few games are not working correctly, ranging from minor to severe problems. Right now the biggest culprit is Final Fantasy VI, but there are others, such as Sims: Pets. Games such as Golden Sun 2 have buggy sound. Some other games I need to test more thoroughly to see if they have problems on the PSP version in particular.

Broken games fall into a few categories. Some work only on the PC version and not the PSP version. Some work only on the interpreter and the dynarec. There are a few that only work on PSP and only work on the dynarec; these I'm not terribly concerned about. As of now all games that I'm aware of that are critically broken don't work anywhere (interpreter or dynarec). This doesn't mean there aren't dynarec bugs left for the PC/PSP versions (which have been synchronized), but I don't know of any at the moment. Unfortunately, the games that don't work for any configuration of gpSP are far the hardest to deal with.

Debugging these games at this point is extremely time consuming and trying. I have put hours into FF6 in particular but have gotten nowhere. Often it's difficult to even know where to begin when trying to find what's wrong with something. The best aid I've had is using the debuggers in other emulators where the game works (such as VBA) to try to track where the emulation diverges from gpSP's. The big problem with this method is that the timing in the two emulators are never going to be the same, even with coercion to try to make it so, so the games will always harmlessly diverge at some point. For this reason a more high level approach has to be taken to examining where the two are different. This is usually feasible when the game has a very obvious problem (such as crashes at a certain point) but can be very difficult when the problems are more subtle than that.

- Performance. I'll cover this more in a later post; the bottom line is that there is still room to make gpSP faster, especially for certain games that don't get along with it well right now.

- Features. The cheat support needs to be improved, I'd like for there to be a low pass filter option for audio (I tried this, and failed..). Of course, everyone wants wifi multiplayer but I don't know what the prospects are for such a thing. That, and I don't have a second PSP or a wireless router.

I'll divulge more details about some of these things later, and as I make any improvements I'll post that here as well.

Why not use virtualization instead of recompilation for GP2X?

This question comes up a lot on the gp32x boards. The idea is that since GP2X has a processor which is binary compatible with GBA and has an MMU, why not just run GBA code "directly" on the GP2X? Unfortunately, things are not that simple. There are a few problems with this:

- The kernel has to be hacked to get a purely custom address space setup. This is doable but tricky.

- The VMM (virtual machine manager, ie, the rest of the emulator) needs to run somewhere. It can run in a different address space, but this would make the transitions (which are often) costly. If it runs in the GBA's address space it could conflict with the way the GBA works, although the real GBA only uses a portion of the physical address space so this probably wouldn't be an issue.

- The emulated code must always be ran in user mode, otherwise it can trample over the emulator itself. Unfortunately, it's not possible to "trap" instructions which can read the state of the CPU, so when the GBA code reads the flags it will get incorrect results and probably fail.

- The same holds for changing the machine state such as disabling interrupts.

- GBA has a lot of components which are time sensitive, such as 4 timers and the screen. Both are very high resolution and cannot be adequetly emulated by the GP2X's timers which run at a different clock rate. For this reason instructions need to be ran within certain intervals to properly serialize them within events, but this cannot be done in hardware with virtualization.

For these reasons I don't forsee virtualization occuring. Commercial grade virtualization emulators like VMWare have until recently only virtualized "user code", while performing recompilation on kernel code. This has only changed due to added hardware support for virtualization in x86 CPUs, support which the ARM9 CPUs in GP2X does not have (as far as I'm aware). It might be possible to pull the same trick on GBA, but even in user mode code the CPU is free to change to other modes at will (and almost always does, for instance, at startup, but can in other places) and there are in fact few restrictions present. The timing issues would still remain as well; for x86 software timing requirements are usually much looser, but on GBA the code must be serialized somewhat properly to within the scanline level.

gpSP for GP2X

These days gpSP is becoming less and less tied to PSP specifically, in particular thanks to the efforts of zodttd. He originally ported gpSP to the GP2X some months ago, but it was a pretty straight forward port (mostly based on the framework for the PC version, which is fairly portable via SDL) only utilizing the portable interpreter for CPU emulation. Since then I had intended to work on a dynamic recompiler to bring performance up to speed, but real life got in the way (plus I haven't had a very working GP2X setup). Fortunately, zodttd managed to pull one off and things are looking pretty good these days. The first public version was made available a few days ago, and I hear that it runs a number of games very well. Of course, there's still room for improvement.

I haven't actually been able to test it yet, but I hope to have an Open2X setup for cygwin as soon as possible. I'll also need a means of powering the thing, either with an AC adapter (preferred, but far too overpriced), or battery recharging. Then, once I'm able to contact zodttd (something I should have done yesterday but managed to collapse and not get back up for a while..) we can go over means of cleaning it up and improving it. There are a few things I've been thinking about for dynarec improvements in general that should apply to both the GP2X and the PSP versions (and others if they surface).

Right now the dynarec will benefit the most from two things: native flag calculation and register allocation.

A quick crash course on the ARM CPU, the CPU that both the GBA and the GP2X have: there are 15 "general purpose" registers where temporary values may be freely loaded and stored. Arithmetic/logical operations may optionally set "flags" which reflect the results; this can be used to conditionally execute instructions. Indirectly speaking, the flags are usually used to test if a variable hit zero, or if one is greater/less than another, and so on.

Because the GBA and GP2X have the same CPU the flag generation behaves the same way, and you should be able to use the GP2X's CPU to generate the flag results in hardware, and also keep them stored in its flags register. There are two caveats to this:

- Sometimes you'll need to perform calculations that will modify flags that are not part of the GBA code but are necessary for emulating it, for instance emulating memory or calling external functions (IE, to update the screen) that will probably modify flags. When these cases occur the flags should be saved then restored to a temporary register or on the stack.

- The flags register actually holds other information for determining the current state of the CPU (if it's in ARM or Thumb mode, if interrupts are enabled, what mode the CPU is in, etc). Consequently, the state of the CPU that the GP2X is in will often differ from the state the GBA code should see. What this amounts to is reconstructing the flags register from the native GP2X one and an emualted one whenever it is read. Likewise, writes to this register must be handled with care (but that's a given). This is a minor issue, but ones like this make complete "virtualization" impossible; I'll talk more about this later.

Right now the dynarec is calling C functions to perform operations involving flags, which is fairly costly. This is especially so for "Thumb" code which performs flags modifications often. On GBA Thumb code is being executed a large percentage of the time (varies depending on the game, is typically 50% or more). In the future flags modifying or utilizing operations will instead be inlined to single ARM instructions.

The next big win will be in register allocation. When emulating a platform where you have significantly less CPU registers than the machine you're recompiling to you can use a complete allocation where every register on the emulated machine maps to a register on the native machine. For the PSP version of gpSP, this is exactly what happens. Unfortunately, in the case of GP2X you have exactly the same number of registers. We cannot perform a complete mapping because we need a few additional registers to perform the emulation, and retain the internal state of the GP2X.

There are two approaches that can be taken at this point: dynamic or static register allocation. Dynamic allocation will load registers into memory as necessary, then keep them in registers until another load forces them out, using a replacement strategy such as "least recently used." This is identical to how dynamic caches such as CPU cache and virtual memory work. Compilers will also use a register allocation scheme such as this, but with more effective (and considerably more computationally expensive) algorithms for allocating registers. This is the approach typically taken by recompilers in emulators, especially when there are far fewer registers on the machine you're emulating on, as is the case with MIPS->x86 (as seen in PS1/PS2 and N64 emulators for PC).

However, dynamic allocation isn't easy to use with a recompiler such as gpSP's because translated code blocks are "self re-entrant", meaning that they can branch to within themselves. This runs into a problem which applies to various areas of recompilation; anywhere you have multiple entry points in a code block, you have multiple conflicting states and can no longer carry on prior state. This can be rectified by flushing the registers back to memory at this point, but this adds a lot of overhead, particularly in code size.

Instead, I think it's worthwhile to opt for a global static allocation scheme. Here, a fixed allocation is chosen for the entire execution of the game; that means that certain GBA registers will always be in GP2X registers, and the others will always be in memory. This provides less efficient allocation so registers will probably be in memory more often, however, it means that registers do not need to be "spilled" back to memory inbetween blocks. It also sidesteps the above problems. Since we should be able to store about 67% or more of the GBA's registers in GP2X registers this should result in instruction operands being in GP2X registers at least 67% time. Fortunately registers do not tend to be used uniformly; those top 67% registers may in fact be used more often than the bottom 33%.

To figure out which registers to use some profiling on GBA games will have to be undertaken; this should be done by modifying the PC version's interpreter and running some games to generate percentage usage of registers.

Blog for gpSP development up and running

So, I decided I'd start up a blog for gpSP development much like StrmnNrmn's blog for Daedalus development. I haven't liked talking publically about development much in the past, but since gpSP has matured a lot and people are more content with it (and hence nagging me a lot less for releases) I figured I'd post things here so I could keep everyone informed and possibly get some feedback/ideas from coders.

I haven't actually had the opportunity to work on anything very much since gpSP 0.9 has been relaesed, due to a lot of work and real life stuff going on, but I hope to be able to pick up again now. The next few posts will start revealing some of the underlying ideas I have for the upcoming version(s).

Feel free to post comments, but try to keep them on topic.