First, let me mention that I've reached the register analysis/caching step in my todo list for the dynarec. So I currently have almost all operations inlined and flags calculated natively.
As I mentioned previously, I will be performing a partial global static register allocation now. This is the simplest possible method of register allocation, where I have N registers available to me in the GP2X's register file and I want to map them to N registers on the GBA. Every time the GBA uses a register that is not one of those N it will have to be loaded from memory first.
Potentially I can free up about 10 registers for allocation. Right now I have 5 free: r3 through r7. Once I have converted some more functions from C to ASM I will be able to move the cycle count register from r8 to r12, and the saved register r9 won't really be needed. This will bring the total up to 7 registers. With a little more work a2 can be removed, and/or sp can be removed. This brings the total up to 8-9.
The reason why I chose global static register alloaction is because I believed that the register usage patterns would be less than uniformly distributed; a uniform distribution that every register is used 6.25% of the time (counting r15, which I would not cache because it is a constant). This means that even though this is the simplest scheme, it may be sufficient to cover most registers.
For the following test I ran these games for a bit each on the PC version. The interpreter w/register analysis was used.
FF: Dawn of Souls (FF2)
Golden Sun
Castlevania: Harmony of Dissonance
FF6 Advance
FFT Advance
Kirby: Nightmare in Dreamland
Mario & Luigi
I tried to pick games w/o idle loops because that would skew the results (since they're not eliminated in the interpreter). As far as I'm aware the only one with them is Kirby, and I ran that for the least amount of time. Even though most of these games are RPGs, their register usage pattern shouldn't vary from any other genre, as far as I expect (although I could be entirely wrong, I doubt it).
There are a few flaws in the analysis: register usage is not counted for the add operand in mla instructions, the PC is not counted sometimes in Thumb (but this shouldn't matter because I wouldn't pick PC anyway). I also count usage in the most general terms, and don't split it up into types of usage, although I tried to tag the usage types with more information in case I want to later. The results are therefore possibly a little crude, but should still give a mostly accurate picture of register usage.
ARM register usage (38.775138% ARM instructions):
r00: 18.263814% (-- 18.263814%)
r12: 11.531477% (-- 29.795291%)
r09: 11.500162% (-- 41.295453%)
r14: 9.063440% (-- 50.358893%)
r06: 7.837682% (-- 58.196574%)
r01: 7.401049% (-- 65.597623%)
r07: 6.778340% (-- 72.375963%)
r05: 5.445009% (-- 77.820973%)
r02: 5.427288% (-- 83.248260%)
r03: 5.293743% (-- 88.542003%)
r04: 3.601103% (-- 92.143106%)
r11: 3.207311% (-- 95.350417%)
r10: 2.334864% (-- 97.685281%)
r08: 1.708207% (-- 99.393488%)
r15: 0.311270% (-- 99.704757%)
r13: 0.295243% (-- 100.000000%)
Thumb register usage (61.224862% Thumb instructions):
r00: 34.788858% (-- 34.788858%)
r01: 26.564083% (-- 61.352941%)
r03: 10.983500% (-- 72.336441%)
r02: 8.303127% (-- 80.639567%)
r04: 4.900381% (-- 85.539948%)
r05: 3.941292% (-- 89.481240%)
r06: 3.257582% (-- 92.738822%)
r07: 2.644851% (-- 95.383673%)
r13: 1.408824% (-- 96.792497%)
r08: 0.906433% (-- 97.698930%)
r09: 0.679693% (-- 98.378623%)
r10: 0.656446% (-- 99.035069%)
r12: 0.453668% (-- 99.488737%)
r14: 0.248909% (-- 99.737646%)
r11: 0.171066% (-- 99.908713%)
r15: 0.091287% (-- 100.000000%)
First, the % of instructions counts all instructions, even ones that do not use registers. However, this is still a good measure for time spent in the different modes, so I'll be using a weighted multiplication of the two values to derive the weighted average, which I think is the most pertinent value to get out of this.
With 6 registers (roughly 40% working set minus r15)
ARM: 65.6%
Thumb: 89.5%
Unweighted average: 77.6%
Weighted average: 80%
With 7 registers: (roughly 46.7% working set minus r15)
ARM: 72.4%
Thumb 92.7%
Unweighted average: 82.6%
Weighted average: 84.8%
With 8 registers: (roughly53.3% working set minus r15)
ARM: 77.8%
Thumb: 95.4%
Unweighted average: 86.7%
Weighted average: 88.6%
So, when using only 53.3% of the registers a good 88.6% of register accesses may be mapped. This is a pretty good chunk, and is much better than the current amount which are mapped, 0%. In fact, rather than going from 7 to 8 registers it might potentially be better to use that 8th register for something else, like caching a PC base (like I do in the PSP version). Even though the PC is a constant, computing it takes a few instructions since it's a very large one.
This register usage information would of course prove useful to any other dynarec that needs it. Even the PC version might use it to free up some registers for other possibly more useful things like pointers.
Oh, btw, I thought I'd mention that doing the GP2X version helped me fix FF6 (and possibly others, it also fixes PocketNES) on the other platforms. I expected this to be a ton of work but I got it for free. Looks like working on this version was good for PSP users too.
Subscribe to:
Post Comments (Atom)
1 comment:
I won't pretend to understand entirely what you said. However, I seem to get the gist, and that most register usage can be mapped without mapping all the registers?
It's great that the GP2x dev helped you fix the FF6 bug :)
Post a Comment