Difference between revisions of "Z80 Optimization"
(→For the sake of size) |
|||
Line 1: | Line 1: | ||
− | |||
== Introduction == | == Introduction == | ||
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics. | Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics. | ||
Line 141: | Line 140: | ||
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last. | * Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last. | ||
* Get out of the main loop special cases check if they aren't needed there. | * Get out of the main loop special cases check if they aren't needed there. | ||
+ | * Rearrange program flow | ||
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it. | * When possible, if you can afford to have a bigger overhead and get code out of the main loop do it. | ||
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance. | * When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance. | ||
Line 249: | Line 249: | ||
<nowiki> | <nowiki> | ||
xor %11111111 | xor %11111111 | ||
− | > | + | ; > |
cpl | cpl | ||
</nowiki> | </nowiki> | ||
Line 300: | Line 300: | ||
</nowiki> | </nowiki> | ||
+ | <nowiki> | ||
+ | sla l ; I've actually seen this! | ||
+ | rl h | ||
+ | ; > | ||
+ | add hl,hl | ||
+ | ; -> | ||
+ | </nowiki> | ||
==== Conditionals ==== | ==== Conditionals ==== | ||
Line 358: | Line 365: | ||
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs. | ;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs. | ||
; -> save 1 byte and 17 T-states | ; -> save 1 byte and 17 T-states | ||
+ | </nowiki> | ||
+ | |||
+ | <nowiki> | ||
+ | ;Never use: | ||
+ | dec B | ||
+ | jr NZ,loop ;I have seen this... | ||
+ | ;Use: | ||
+ | djnz loop | ||
+ | </nowiki> | ||
+ | |||
+ | <nowiki> | ||
+ | ;Instead of | ||
+ | dec bc | ||
+ | ld a,b | ||
+ | or c | ||
+ | ret z | ||
+ | ;try this | ||
+ | cpi ;increments HL | ||
+ | ret po | ||
</nowiki> | </nowiki> | ||
Line 621: | Line 647: | ||
;code or data goes here | ;code or data goes here | ||
− | + | End: | |
.echo "size of the code/data:" | .echo "size of the code/data:" | ||
− | .echo | + | .echo End-SomeCodeorData |
</nowiki> | </nowiki> | ||
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]]) | * Get a nice IDE of z80 that counts code ([[IDEs|IDE's]]) | ||
− | * Make use of the counting capabilities of an emulator ([[Emulators|Emulators]]) | + | * Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) |
== Related topics == | == Related topics == |
Revision as of 01:46, 9 November 2009
Contents
Introduction
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.
Registers and Memory
Generally good algorithms on z80 use registers in a appropriate form. It is also a good practise to keep a convention and plan how you are going to use the registers.
General use of registers:
- a - 8-bit accumulator
- b - counter
- hl - 16-bit accumulator/pointer of a address memory
- de - pointer of a destination address memory
- bc - 16-bit counter
- ix - index register/save copy of hl/pointer to memory when hl and de are being used
Stack
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.
Allocation
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2. An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes ld hl, -7 add hl, sp ld sp, hl
Access
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :
; 4 bytes, 19 T-states ld c, (ix + n) ; n is an immediate value in -128..127 ; 4 bytes, 17 T-states, destroys a ld a, (somelocation) ld c, a
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).
; 3 bytes, 19 T-states cp (ix + n) sub (ix + n) sbc a, (ix + n) add a, (ix + n) adc a, (ix + n) inc (ix + n) dec (ix + n) and (ix + n) or (ix + n) xor (ix + n) ; 4 bytes, 23 T-states rl (ix + n) rr (ix + n) rlc (ix + n) rrc (ix + n) sla (ix + n) sra (ix + n) sll (ix + n) srl (ix + n) bit k, (ix + n) ; k is an immediate value in 0..7 set k, (ix + n) res k, (ix + n)
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).
Deallocation
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation... inc sp inc sp
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop ld hl, 16 add hl, sp ld sp, hl
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)
Shadow registers
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :
ex af, af' ; swaps af and af' as the mnemonic indicates exx ; swaps bc, de, hl and bc', de', hl'
Shadow registers can be of a great help but they come with two drawbacks :
- they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :
; loads hl' with the contents of hl push hl exx pop hl
- they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point push af ; save flags di ; disable interrupts ; do something with shadow registers here pop af ; get back flags ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called ei ; re-enable interrupts ret
General Algorithms
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.
A list of things to keep in mind:
- Rework conditionals to be more efficient.
- Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.
- Get out of the main loop special cases check if they aren't needed there.
- Rearrange program flow
- When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.
- When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.
- Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)
- Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.
- Document wacky optimizations to understand the code later
Small Tricks
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.
Optimize size and speed
Loading stuff
;Instead of: ld a,0 ;Try this: xor a ;disadvantages: changes flags ;or sub a ;disadvantages: changes flags ; -> save 1 byte and 3 T-states
;Instead of ld b,$20 ld c,$30 ;try this ld bc,$2030 ;or this ld bc,(b_num * 256) + c_num ;where b_num goes to b register and c_num to c register ; -> save 1 byte and 4 T-states
;Instead of ld a,$42 ld (hl),a ;try this ld (hl),$42 ; -> save 1 byte and 4 T-states
;Instead of xor a ld (data1),a ld (data2),a ld (data3),a ld (data4),a ld (data5),a ;if data1 to data5 are one after the other ;try this ld hl,data1 ld de,data1+1 xor a ld (hl),a ld bc,4 ldir ; -> save 3 bytes for every ld (dataX), after passing the initial overhead
;Instead of ld a,(var) inc a ld (var),a ;try this ;Note: if hl is not tied up, use indirection: ld hl,var inc (hl) ld a,(hl) ;if you don't need (hl) in a, delete this line ; -> save 2 bytes and 2 T-states
; Instead of : ld a, (hl) ld (de), a inc hl inc de ; Use : ldi inc bc ; -> save 1 byte and 4 T-states
Math and Logic tricks
;Instead of: cp 0 ;Use or a ; -> save 1 byte and 3 T-states
cp 1 ; > dec a ;changes a! ; -> save 1 byte and 3 T-states
xor %11111111 ; > cpl
;Instead of ld de,767 or a ;reset carry so sbc works as a sub sbc hl,de ;try this ld de,-767 ;negation of de add hl,de ; -> 2 bytes and 8 T-states !
;Instead of ld de,-767 add hl,de ;try this dec h ; -256 dec h ; -512 dec h ; -768 inc hl ; -767 ;Note that works in many other cases ; -> save 3 T-states
;Instead of srl a srl a srl a ;try this rrca rrca rrca and %00011111 ; -> save 1 byte and 5 T-states
;Instead of neg add a,N ;you want to calculate N-A ;Do it this way: cpl add a,N+1 ;neg is practically equivalent to cpl \ inc a ; -> save 1 byte and 4 T-states
sla l ; I've actually seen this! rl h ; > add hl,hl ; ->
Conditionals
and 1 cp 1 jr z,foo ; > and 1 ;and sets zero flag, no need for cp jr nz,foo ; -> save 2 bytes and 7 T-states
and 1 cp 1 ;a not needed after this jr z,foo ; > rra jr c,foo
bit 0,a call z,foo ; > rra call nc,foo
bit 7,a jr z,foo ; > rla jr nc,foo
bit 2,a ret nz xor a ; > and %100 ret nz
Others
Calling and returning...
;Instead of call xxxx ret ;try this jp xxxx ;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs. ; -> save 1 byte and 17 T-states
;Never use: dec B jr NZ,loop ;I have seen this... ;Use: djnz loop
;Instead of dec bc ld a,b or c ret z ;try this cpi ;increments HL ret po
;Instead of loop: ld a,2 ;code1 ld a,0 ;code2 djnz loop ;try this ld a,2 loop: ;code1 xor $01 ; the trick is xor logic make a register alternate between two values ;code2 djnz loop ; -> save size and time depending on its use
Size vs. Speed
The classical problem of optimization in computer programming, Z80 is no exception. In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.
For the sake of size
- Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:
;lots of code (more that 128 bytes worth of code) somelabel2: jp somelabel ;less than 128 bytes jr somelabel2 ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.
- Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.
Passing inline data
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).
Instead of: ld hl,string bcall(_vputs) ret ;Try this: call Disp .db "This is some text",0 ret ;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address. ;It also heavily disturbs disassembly. Disp: pop hl bcall(_vputs) jp (hl) ; -> save 2 bytes for each use, but 4 bytes of overhead (Disp routine)
This routine can be expanded to pass the coordinates where the text should appear.
Wasting time to delay
There are those funny times that you need some delay between operations like reads/writes to ports and there is nothing useful to do. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:
;Instead of ld a,KEY_GROUP out (1),a nop nop in a,(1) ;Try this: ld a,KEY_GROUP out (1),a ld a,(de) ;a doesn't need to be preserved because it will hold what the port has. in a,(1) ; -> save 1 byte and 1 T-state (well 1 T-state less is almost the same time)
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :
; 2 bytes, 8 T-states nop nop ; 2 bytes, 12 T-states inc hl dec hl ; 2 bytes, 12 T-states jr $+2 ; 2 bytes, 21 T-states push af pop af ; 2 bytes, 38 T-states ex (sp), hl ex (sp), hl
If you need a small adjustable delay:
;4 bytes, b*13+8 T-states (variable) ld b,255 ; initial delay djnz $ ; do it ;b=0 on exit
Notes:
- There are many other instructions that you can use
- Beware that not all instructions preserve registers or flags
- For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).
Unrolling code
General Unrolling You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication. This means you are wasting memory to gain speed. Most times you are preferring size to speed.
Unroll commands
; "Classic" way : ~21 T-states per byte copied ld hl,src ld de,dest ld bc,size ldir ; Unrolled : (16 * size + 10) / n -> ~18 T-states per byte copied when unrolling 8 times ld hl,src ld de,dest ld bc,size ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration loopldi: ;you can use this entry for a call ldi ldi ldi ldi ldi ldi ldi ldi jp pe, loopldi ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size ; ret if this is a subroutine and use the unrolled ldi's with a call.
This unroll of ldi also works with outi and ldr.
Looping with 16 bit counter
There are two ways to make loops with a 16bit counter :
- the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a
ld bc, ... loop: ; loop body here dec bc ld a, b or c jp nz,loop
- the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)
dec de ld b, e inc b inc d loop2: ; loop body here djnz loop2 dec d jp nz,loop2
The rationale behind the second method is to reduce the overhead of the "inner" loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz.
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :
#define inner_counter8(counter16) (((counter16) - 1) & 0xff) + 1 #define outer_counter8(counter16) (((counter16) - 1) >> 8) + 1
Setting flags
In some occasion you might want to selectively set/reset a flag.
Here are the most common uses :
; set Carry flag scf ; reset Carry flag (alters Sign and Zero flags as defined) or a ; alternate reset Carry flag (alters Sign and Zero flags as defined) and a ; set Zero flag (resets Carry flag, alters Sign flag as defined) cp a ; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined) or 1 ; set Sign flag (negative) (alters a, reset Zero and Carry flags) or $80 ; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag) xor a
Other possible uses (much rarer) :
;Set parity/overflow (even): xor a ;Reset parity/overflow (odd): sub a ;Set half carry (hardly ever useful but still...) and a ;Reset half carry (hardly ever useful but still...) or a ;Set bit 5 of f: or %00100000
As you can see these are extremely simple, small and fast ways to alter flags which make them interesting as output of routines to indicate error/success or other status bits that do not require a full register.
Were you to use this, remember that these flag (re)setting tricks frequently overlap so if you need a special combination of flags it might require slightly more elaborate tricks. As a rule of a thumb, always alter the carry last in such cases because the scf and ccf instructions do not have side effects.
More advance ways of manipulating flags follow:
;get the zero flag in carry scf jr z,$+3 ccf ;Put carry flag into zero flag. ccf sbc a, a
Tools of the job
Want to try test your optimization or test new ones? Then you have to check this:
- Keep a z80 instruction set to not forget a useful instruction. (see Z80_Instruction_Set)
- Get a assembler that can echo and use this in the source to count: (see Assemblers)
SomeCodeorData: ;code or data goes here End: .echo "size of the code/data:" .echo End-SomeCodeorData
- Get a nice IDE of z80 that counts code (IDE's)
- Make use of the counting capabilities of an emulator (Emulators)
Related topics
- MaxCodez TI-ASM optimization
- ticalc archives: 1 2
- Balley Alley Z80 Machine Language Documentation
- Fast loops in MSX Assembly Page
- Shiar z80 optimization page
Acknowledgements
- fullmetalcoder
- Galandros