Tuesday, April 3, 2007

ARM alpha blending pass

I rewrote the expand pass for alpha blending in ARM ASM, which should benefit the GP2X version and any future ARM versions (remember, there are a ton of prospective ARM devices out there). Even though my version is very straightforward and something that a compiler should have been able to generate it's still much better than what GCC generated with the optimization flags on now (I don't think that it could do any better with any others, but I don't know for sure).

The more actual blending going on onscreen the more of a speedup this version might give. Here are the numbers:

Castlevania: Aria of Sorrow
Full test : 6456 ms (21.522274 ms per frame)
No blending : 5379 ms (17.930580 ms per frame)
No video : 2241 ms (7.472070 ms per frame)
No CPU : 4392 ms (14.642917 ms per frame)
No CPU/video: 361 ms (1.204903 ms per frame)

CPU speed : 1880 ms (6.267167 ms per frame)
Video speed : 4215 ms (14.050203 ms per frame)
Alpha cost : 1077 ms (3.591693 ms per frame)

This one is the biggest winner. It has a ton of blending going on onscreen. The alpha cost has lowered by about 4.13ms over the C version - it's over twice as fast now.

Mario Kart:
Full test : 8843 ms (29.476669 ms per frame)
No blending : 8129 ms (27.098631 ms per frame)
No video : 4238 ms (14.129470 ms per frame)
No CPU : 3221 ms (10.737390 ms per frame)
No CPU/video: 505 ms (1.684263 ms per frame)

CPU speed : 3733 ms (12.445207 ms per frame)
Video speed : 4604 ms (15.347200 ms per frame)
Alpha cost : 713 ms (2.378040 ms per frame)

Here we see a smaller improvement, because the alpha cost wasn't that large to begin with. That's because alpha is only actually turned on for a small part of the screen. Last post I believed that brighten was used instead of alpha. This is what should have been used, but I think it's brightening by something a bit off-white. This effect is very subtle in-game, but if you turn it off (by making it choose the BOTTOM pixel, not the top) it becomes obvious it isn't there.

Still a win, and what's more, it shows that the difference between a small amount of blending and a large amount isn't that high.

FF6:
Full test : 11854 ms (39.515438 ms per frame)
No blending : 10175 ms (33.918694 ms per frame)
No video : 6133 ms (20.443438 ms per frame)
No CPU : 2643 ms (8.810610 ms per frame)
No CPU/video: 1142 ms (3.809923 ms per frame)

CPU speed : 4990 ms (16.633512 ms per frame)
Video speed : 5721 ms (19.072001 ms per frame)
Alpha cost : 1679 ms (5.596743 ms per frame)

Alpha cost is down again, but this time the smallest percentage-wise. I think that there might be other things making it slow, like heavy usage of windows. This game in particular deserves extra attention. Again, alpha isn't used on very much of the screen - it mainly just provides the gradient effect.

Next time I'll talk about some changes I'd like to try for the C code (which will hopefully impact the PSP version as well).

Monday, April 2, 2007

Profiling performance

NOTE: Even though this post is specifically geared to GP2X development the techniques it employs can be used on other platforms. I could run the profiler on the PSP version as well, it's just not as nice to print from until I finally start using PSPlink. It's possible that some optimizations I end up doing could improve performance for all versions too.


With my first release of gpSP (not the first release in general, just mine) for GP2X out of the way, I've decided to focus on all aspects of improving performance of the emulator. To make it more clear what's using up how much time resources on the GP2X I did a simple profiler that would load a savestate, run for N frames (in these cases 300, which is 5 seconds of virtual playtime), then reload the savestate and run another N frames, this time with various things turned off. If you take run A with X turned on then subtract from it run B with X turned off you can find ROUGHLY how long X took within A.

I say roughly because sometimes the time of two things both running will be greater than the sum of their parts (it's possible for this to be significantly so). The main reason for this would be cache competition since the processing can be tightly interleaved (in the case of CPU and video).

Furthermore, all the tests are run with synchronization off, so there are no artificial delays.

So, these are the tests:

Full: Everything is ran.

No alpha blending: I noticed that alpha blending can be very expensive, on all platforms. I also noticed GCC was producing shoddy code for it, which I could maybe do a better version of in ARM ASM (for the GP2X version specifically). So I decided to put this here. This is accomplished by setting a flag which forces the color mode in the BLDCNT register to 0 every scanline.

No video: I think video is the real CPU eater most of the time, so running without it focuses on the CPU (but you'll see there's more left than this). Accomplished by turning on manual frameskip and setting it really, really high (1000000).

No CPU: Kinda the inverse of the above. Makes the game frozen w/ horrible noise playing. This is accomplished by putting the CPU in HALT mode and turning off interrupts.

No CPU/Video: Here the residual things are tested. This includes audio timer performance - if the game had audio timers running in the background they'll still be running (hence the horrible noise), but the GBC channels will probably be off so they aren't tested. This also includes the main state transitions, timer updating/triggering, input polling, background stuff running on the platform (in GP2X's case it should hopefully just be the kernel, SSH server, and a few other low priority things).


And now for some actual test runs along with some commentary. All tests are ran on GP2X at 200MHz - with overclocking you'll achieve better results, but the actual numbers don't matter as much as their relativity to each other and how much they can be improved. However, you'd still ideally want one frame to take 16.67ms or less. If testing just to see how close it is to being full speed I'd do so with at least some overclocking (240-260MHz) since that's what most people will use. All tests also use the mmuhack, of course.

I tried to pick areas where the video load was pretty constant and it didn't look like there'd be a lot of fluctuations in CPU usage. Of course I avoided pressing anything while the tests ran.

Initially I put in numbers approximating the CPU speed/video speed by taking differences between full test and the test w/o the component. This ended up being heavily flawed because of my latter two tests, which change the video mode mid-frame, so disabling the CPU completely changes video performance. So what I settled on was the following:

Video speed ~= full test - no video
CPU speed ~= full test - video speed - residual speed
Alpha cost ~= full test - no blending

This means that for now the no CPU test isn't contributing to the results at the bottom, but can still be useful, especially for seeing how lighter video loads cope.


Castlevania: Aria of Sorrow: This test was done in the name entry screen. I chose this area because there is a lot of alpha blending going on onscreen.

Benchmark results (300 frames):
Full test : 7716 ms (25.722994 ms per frame)
No blending : 5398 ms (17.993380 ms per frame)
No video : 2245 ms (7.484780 ms per frame)
No CPU : 5146 ms (17.154354 ms per frame)
No CPU/video: 360 ms (1.201827 ms per frame)

CPU speed : 1884 ms (6.282953 ms per frame)
Video speed : 5471 ms (18.238213 ms per frame)
Alpha cost : 2318 ms (7.729613 ms per frame)

Here we can see the video is quite expensive, with a lot of that going to the alpha blending. The CPU isn't too bad, but of course significant and improvements in that area will hopefully show up on this test. The residual cost is relatively minor, but still takes up a chunk that can't be ignored. What should be researched is why the video cost is so high even without blending. It could be window usage, which can perhaps be optimized.


Mario Kart: Here I ran it ingame some time into the first race. This is a good example of a fairly heavy but still realistic video load (since it uses the affine transformed BG mode which is expensive, but it only uses it for about 2/3rds of the screen.. also uses a bit of alpha blending)

Benchmark results (300 frames):
Full test : 9036 ms (30.121626 ms per frame)
No blending : 8117 ms (27.058737 ms per frame)
No video : 4241 ms (14.137693 ms per frame)
No CPU : 3215 ms (10.719350 ms per frame)
No CPU/video: 505 ms (1.685850 ms per frame)

CPU speed : 3735 ms (12.451843 ms per frame)
Video speed : 4795 ms (15.983933 ms per frame)
Alpha cost : 918 ms (3.062890 ms per frame)

Now we see the CPU speed has gone up a lot. I hope for improvements to the dynarec and memory subsystem code to show up the most here. The video speed is high as well - if you take out the alpha cost then it's quite a bit higher than Aria of Sorrow's because of all the work that has to be done with scaling and rotating the backgrounds. It's possible that both this and the alpha blending can be improved with ARM ASM versions. The residual cost has gone up a little, but is within the same range. The alpha cost is a lot lower because it's not using actual alpha blending, but just color fades to brighten a few scanlines. This is cheaper to both do the base rendering on and perform the actual blending on. It's still there, of course.


Final Fantasy 6 Advance: This game is extremely demanding, especially in battle. So that's where I tested it.

Benchmark results (300 frames):
Full test : 12088 ms (40.295025 ms per frame)
No blending : 9948 ms (33.162895 ms per frame)
No video : 6123 ms (20.412947 ms per frame)
No CPU : 2663 ms (8.878957 ms per frame)
No CPU/video: 1143 ms (3.810030 ms per frame)

CPU speed : 4980 ms (16.602917 ms per frame)
Video speed : 5964 ms (19.882076 ms per frame)
Alpha cost : 2139 ms (7.132130 ms per frame)

Here we can see that the CPU speed is quite high. FF games on GBA are prone to using a lot of high frequency IRQs which incurs a good amount of expensive opcodes frequently - it's possible that switching to HLE BIOS solutions could help here. Again the video is quite expensive as well, with a lot of alpha being used to minor effect but expensively (the gradients on the menus, for instance). Of interest is the high residual cost, which has gone up quite a lot compared to the other games. This may be due to higher bitrate audio (haven't looked at the actual speeds, just a possibility). If this is the case then optimizing the audio code can improve this.


Final remarks: In one of the cases CPU didn't take much time, but for the others it was about even with video rendering. With this in mind it may be possible to gain some performance by parallelizing the two (both on GP2X and PSP) by splitting between CPU and video on the CPUs, and deciding which one gets the rest of things (but which to put where for GP2X? The choice is more obvious for PSP). This isn't at all straightforward to do however, and for framebuffer (3D) games this should be turned off. CPU and video are still heavily interleaved so you wouldn't get anything close to the performance of the slower of the two, but there could still be some improvement.

Optimizing the video code is something I'm more interested in than I have been for a while. Of course, the optimizing the CPU still takes utmost priority because frameskip can diminish the video time dramatically while still keeping the game playable (going from fs0 to fs1 makes a huge difference). And, just a little difference in CPU speed could make the difference between being able to go up a notch in fs, or more importantly, actually getting below the magical 16.7ms (preferably at reasonable clock speeds).