Difference between revisions of "Z80 Optimization"
(→General) |
|||
Line 2: | Line 2: | ||
== Introduction == | == Introduction == | ||
− | Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. | + | Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: mapping, grayscale and 3D graphics. |
− | == Registers and | + | == Registers and Memory == |
General algorithm improvements and correct use of registers. | General algorithm improvements and correct use of registers. | ||
+ | General use of registers: | ||
+ | * a accumulator | ||
+ | * b counter | ||
+ | |||
+ | * hl 16-bit accumulator/pointer to memory | ||
+ | * de pointer of destination in memory | ||
=== Stack === | === Stack === | ||
Line 126: | Line 132: | ||
Note that the following tricks act much like a peephole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code. | Note that the following tricks act much like a peephole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code. | ||
+ | === Optimize size and speed === | ||
+ | |||
+ | ==== Loading stuff ==== | ||
<nowiki> | <nowiki> | ||
;Instead of: | ;Instead of: | ||
Line 142: | Line 151: | ||
sub a ;disadvantages: changes flags | sub a ;disadvantages: changes flags | ||
; -> save 1 byte and 3 T-states | ; -> save 1 byte and 3 T-states | ||
+ | </nowiki> | ||
+ | |||
+ | <nowiki> | ||
+ | ;Instead of | ||
+ | ld b,$20 | ||
+ | ld c,$30 | ||
+ | ;try this | ||
+ | ld bc,$2030 | ||
+ | ;or this | ||
+ | ld bc,(b_num * 256) + c_num | ||
+ | ; -> save 1 byte and 4 T-states | ||
+ | </nowiki> | ||
+ | |||
+ | <nowiki> | ||
+ | ;Instead of | ||
+ | xor a | ||
+ | ld (data1),a | ||
+ | ld (data2),a | ||
+ | ld (data3),a | ||
+ | ld (data4),a | ||
+ | ld (data5),a ;if data1 to data5 are one after the other | ||
+ | ;try this | ||
+ | ld hl,data1 | ||
+ | ld de,data1+1 | ||
+ | xor a | ||
+ | ld (hl),a | ||
+ | ld bc,4 | ||
+ | ldir | ||
+ | ; -> save 3 bytes for every ld (dataX),a | ||
+ | </nowiki> | ||
+ | |||
+ | <nowiki> | ||
+ | ;Instead of | ||
+ | ld a,(var) | ||
+ | inc a | ||
+ | ld (var),a | ||
+ | ;try this ;if hl is not tied up and all you do is check flags, use indirection: | ||
+ | ld hl,var | ||
+ | inc (hl) | ||
+ | ld a,(hl) | ||
</nowiki> | </nowiki> | ||
Line 155: | Line 204: | ||
; -> save 1 byte and 4 T-states | ; -> save 1 byte and 4 T-states | ||
</nowiki> | </nowiki> | ||
+ | |||
+ | ==== Math tricks ==== | ||
+ | |||
+ | <nowiki> | ||
+ | ;Instead of | ||
+ | ld de,-767 | ||
+ | add hl,de | ||
+ | ;try this | ||
+ | dec h ; -256 | ||
+ | dec h ; -512 | ||
+ | dec h ; -768 | ||
+ | inc hl ; -767 | ||
+ | </nowiki> | ||
+ | |||
+ | <nowiki> | ||
+ | ;Instead of | ||
+ | srl a | ||
+ | srl a | ||
+ | srl a | ||
+ | ;try this | ||
+ | rrca | ||
+ | rrca | ||
+ | rrca | ||
+ | and %00011111 | ||
+ | </nowiki> | ||
+ | |||
+ | <nowiki> | ||
+ | ;Instead of | ||
+ | loop: | ||
+ | ld a,2 | ||
+ | ;code1 | ||
+ | ld a,0 | ||
+ | ;code2 | ||
+ | |||
+ | ;try this | ||
+ | ld a,2 | ||
+ | loop: | ||
+ | ;code1 | ||
+ | xor $01 | ||
+ | ;code2 | ||
+ | </nowiki> | ||
+ | |||
+ | ==== Looping ==== | ||
+ | <nowiki> | ||
+ | ;smaller, destroys a | ||
+ | loop: | ||
+ | ld a, d | ||
+ | or e | ||
+ | jp nz,loop | ||
+ | ;faster, uses b, bigger | ||
+ | dec de | ||
+ | ld b, e | ||
+ | inc b | ||
+ | inc d | ||
+ | loop2: | ||
+ | djnz loop2 | ||
+ | dec d | ||
+ | jp nz,loop2 | ||
+ | </nowiki> | ||
+ | |||
+ | === Size vs. Speed === | ||
+ | <nowiki> | ||
+ | ;Unroll ldir to ldi's and make use of flag po | ||
+ | </nowiki> | ||
+ | |||
== Setting flags == | == Setting flags == | ||
− | In some | + | In some occasion you might want to selectively set/reset a flag. |
Here are the most common uses : | Here are the most common uses : | ||
Line 206: | Line 320: | ||
more elaborate tricks. As a rule of a thumb, always alter the carry last in | more elaborate tricks. As a rule of a thumb, always alter the carry last in | ||
such cases because the scf and ccf instructions do not have side effects. | such cases because the scf and ccf instructions do not have side effects. | ||
+ | |||
+ | == Related topics == | ||
+ | * MaxCodez topic | ||
+ | * ticalc docs | ||
+ | * doc by some TI programmer in Balley Z80 | ||
+ | |||
+ | == Acknowledgements == | ||
+ | * fullmetalcoder and Galandros contributions |
Revision as of 06:43, 4 November 2009
Contents
Introduction
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: mapping, grayscale and 3D graphics.
Registers and Memory
General algorithm improvements and correct use of registers.
General use of registers:
- a accumulator
- b counter
- hl 16-bit accumulator/pointer to memory
- de pointer of destination in memory
Stack
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.
Allocation
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2. An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes ld hl, -7 add hl, sp ld sp, hl
Access
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :
; 4 bytes, 19 T-states ld c, (ix + n) ; n is an immediate value in -128..127 ; 4 bytes, 17 T-states, destroys a ld a, (somelocation) ld c, a
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).
; 3 bytes, 19 T-states cp (ix + n) sub (ix + n) sbc a, (ix + n) add a, (ix + n) adc a, (ix + n) inc (ix + n) dec (ix + n) and (ix + n) or (ix + n) xor (ix + n) ; 4 bytes, 23 T-states rl (ix + n) rr (ix + n) rlc (ix + n) rrc (ix + n) sla (ix + n) sra (ix + n) sll (ix + n) srl (ix + n) bit k, (ix + n) ; k is an immediate value in 0..7 set k, (ix + n) res k, (ix + n)
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).
Deallocation
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation... inc sp inc sp
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop ld hl, 16 add hl, sp ld sp, hl
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)
Shadow registers
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :
ex af, af' ; swaps af and af' as the mnemonic indicates exx ; swaps bc, de, hl and bc', de', hl'
Shadow registers can be of a great help but they come with two drawbacks :
- they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :
; loads hl' with the contents of hl push hl exx pop hl
- they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point push af ; save flags di ; disable interrupts ; do something with shadow registers here pop af ; get back flags ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called ei ; re-enable interrupts ret
Small Tricks
Note that the following tricks act much like a peephole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.
Optimize size and speed
Loading stuff
;Instead of: cp 0 ;Use or a ; -> save 1 byte and 3 T-states
;Instead of: ld a,0 ;Try this: xor a ;disadvantages: changes flags ;or sub a ;disadvantages: changes flags ; -> save 1 byte and 3 T-states
;Instead of ld b,$20 ld c,$30 ;try this ld bc,$2030 ;or this ld bc,(b_num * 256) + c_num ; -> save 1 byte and 4 T-states
;Instead of xor a ld (data1),a ld (data2),a ld (data3),a ld (data4),a ld (data5),a ;if data1 to data5 are one after the other ;try this ld hl,data1 ld de,data1+1 xor a ld (hl),a ld bc,4 ldir ; -> save 3 bytes for every ld (dataX),a
;Instead of ld a,(var) inc a ld (var),a ;try this ;if hl is not tied up and all you do is check flags, use indirection: ld hl,var inc (hl) ld a,(hl)
; Instead of : ld a, (hl) ld (de), a inc hl inc de ; Use : ldi inc bc ; -> save 1 byte and 4 T-states
Math tricks
;Instead of ld de,-767 add hl,de ;try this dec h ; -256 dec h ; -512 dec h ; -768 inc hl ; -767
;Instead of srl a srl a srl a ;try this rrca rrca rrca and %00011111
;Instead of loop: ld a,2 ;code1 ld a,0 ;code2 ;try this ld a,2 loop: ;code1 xor $01 ;code2
Looping
;smaller, destroys a loop: ld a, d or e jp nz,loop ;faster, uses b, bigger dec de ld b, e inc b inc d loop2: djnz loop2 dec d jp nz,loop2
Size vs. Speed
;Unroll ldir to ldi's and make use of flag po
Setting flags
In some occasion you might want to selectively set/reset a flag.
Here are the most common uses :
; set Carry flag scf ; reset Carry flag (alters Sign and Zero flags as defined) or a ; alternate reset Carry flag (alters Sign and Zero flags as defined) and a ; set Zero flag (resets Carry flag, alters Sign flag as defined) cp a ; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined) or 1 ; set Sign flag (negative) (alters a, reset Zero and Carry flags) or $80 ; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag) xor a
Other possible uses (much rarer) :
; Set parity/overflow (even): xor a Reset parity/overflow (odd): sub a ; set half carry (hardly ever useful but still...) and a ; reset half carry (hardly ever useful but still...) or a
As you can see these are extremely simple, small and fast ways to alter flags which make them interesting as output of routines to indicate error/success or other status bits that do not require a full register.
Were you to use this, remember that these flag (re)setting tricks frequently overlap so if you need a special combination of flags it might require slightly more elaborate tricks. As a rule of a thumb, always alter the carry last in such cases because the scf and ccf instructions do not have side effects.
Related topics
- MaxCodez topic
- ticalc docs
- doc by some TI programmer in Balley Z80
Acknowledgements
- fullmetalcoder and Galandros contributions