<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="https://wikiti.brandonw.net/skins/common/feed.css?303"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
		<id>https://wikiti.brandonw.net/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Einar</id>
		<title>WikiTI - User contributions [en]</title>
		<link rel="self" type="application/atom+xml" href="https://wikiti.brandonw.net/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Einar"/>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Special:Contributions/Einar"/>
		<updated>2026-04-05T18:58:12Z</updated>
		<subtitle>User contributions</subtitle>
		<generator>MediaWiki 1.23.5</generator>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:Random</id>
		<title>Z80 Routines:Math:Random</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:Random"/>
				<updated>2020-10-05T12:44:51Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Added links to Patrik Rak's excellent random number generators&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Z80 Routines:Math|Random]]&lt;br /&gt;
[[Category:Z80 Routines|Random]]&lt;br /&gt;
&lt;br /&gt;
==Ion Random==&lt;br /&gt;
This is based off the tried and true [http://en.wikipedia.org/wiki/PRNG pseudorandom number generator] featured in Ion by Joe Wingbermuehle&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;;-----&amp;gt; Generate a random number&lt;br /&gt;
; output a=answer 0&amp;lt;=a&amp;lt;=255&lt;br /&gt;
; all registers are preserved except: af&lt;br /&gt;
random:&lt;br /&gt;
        push    hl&lt;br /&gt;
        push    de&lt;br /&gt;
        ld      hl,(randData)&lt;br /&gt;
        ld      a,r&lt;br /&gt;
        ld      d,a&lt;br /&gt;
        ld      e,(hl)&lt;br /&gt;
        add     hl,de&lt;br /&gt;
        add     a,l&lt;br /&gt;
        xor     h&lt;br /&gt;
        ld      (randData),hl&lt;br /&gt;
        pop     de&lt;br /&gt;
        pop     hl&lt;br /&gt;
        ret&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
randData here must be a 2 byte seed located in ram.  While this is a fast generator, it's generally not considered very good in terms of randomness.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Linear Feedback Shift Register==&lt;br /&gt;
This particular prng is based on [http://en.wikipedia.org/wiki/LFSR Linear feedback shift register].  It uses a 64bit seed and generates 8 new bits at every call. LFSRSeed must be an 8 byte seed located in ram.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;;------LFSR------&lt;br /&gt;
;James Montelongo&lt;br /&gt;
;optimized by Spencer Putt&lt;br /&gt;
;out:&lt;br /&gt;
; a = 8 bit random number&lt;br /&gt;
RandLFSR:&lt;br /&gt;
        ld hl,LFSRSeed+4&lt;br /&gt;
        ld e,(hl)&lt;br /&gt;
        inc hl&lt;br /&gt;
        ld d,(hl)&lt;br /&gt;
        inc hl&lt;br /&gt;
        ld c,(hl)&lt;br /&gt;
        inc hl&lt;br /&gt;
        ld a,(hl)&lt;br /&gt;
        ld b,a&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        rl c \ rla&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        rl c \ rla&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        rl c \ rla&lt;br /&gt;
        ld h,a&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        rl c \ rla&lt;br /&gt;
        xor b&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        xor h&lt;br /&gt;
        xor c&lt;br /&gt;
        xor d&lt;br /&gt;
        ld hl,LFSRSeed+6&lt;br /&gt;
        ld de,LFSRSeed+7&lt;br /&gt;
        ld bc,7&lt;br /&gt;
        lddr&lt;br /&gt;
        ld (de),a&lt;br /&gt;
        ret&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
While this may produces better numbers, it is slower, larger and requires a bigger seed than ionrandom.  Assuming theres is a good seed to start, it should generate ~2^56 bytes before repeating.  However if there is not a good seed(0 for example), then the numbers created will not be adequate.  Unlike Ionrandom and its use of the r register, starting with the same seed the same numbers will be generated. With Ionrandom the code running may have an impact on the number generated. This means this method requires more initialization.&lt;br /&gt;
&lt;br /&gt;
You can initialize with TI-OS's seeds, stored at seed1 and seed2, both are ti-floats but will serve the purpose.&lt;br /&gt;
&lt;br /&gt;
==Combined LFSR/LCG, 16-bit seeds==&lt;br /&gt;
This is a very fast, quality pseudo-random number generator. It combines a 16-bit Linear Feedback Shift Register and a 16-bit LCG.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
prng16:&lt;br /&gt;
;Inputs:&lt;br /&gt;
;   (seed1) contains a 16-bit seed value&lt;br /&gt;
;   (seed2) contains a NON-ZERO 16-bit seed value&lt;br /&gt;
;Outputs:&lt;br /&gt;
;   HL is the result&lt;br /&gt;
;   BC is the result of the LCG, so not that great of quality&lt;br /&gt;
;   DE is preserved&lt;br /&gt;
;Destroys:&lt;br /&gt;
;   AF&lt;br /&gt;
;cycle: 4,294,901,760 (almost 4.3 billion)&lt;br /&gt;
;160cc&lt;br /&gt;
;26 bytes&lt;br /&gt;
    ld hl,(seed1)&lt;br /&gt;
    ld b,h&lt;br /&gt;
    ld c,l&lt;br /&gt;
    add hl,hl&lt;br /&gt;
    add hl,hl&lt;br /&gt;
    inc l&lt;br /&gt;
    add hl,bc&lt;br /&gt;
    ld (seed1),hl&lt;br /&gt;
    ld hl,(seed2)&lt;br /&gt;
    add hl,hl&lt;br /&gt;
    sbc a,a&lt;br /&gt;
    and %00101101&lt;br /&gt;
    xor l&lt;br /&gt;
    ld l,a&lt;br /&gt;
    ld (seed2),hl&lt;br /&gt;
    add hl,bc&lt;br /&gt;
    ret&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
On their own, LCGs and LFSRs don't produce great results and are generally very cyclical, but they are very fast to compute. The 16-bit LCG in the above example will bounce around and reach each number from 0 to 65535, but the lower bits are far more predictable than the upper bits. The LFSR mixes up the predictability of a given bit's state, but it hits every number except 0, meaning there is a slightly higher chance of any given bit in the result being a 1 instead of a 0. It turns out that by adding together the outputs of these two generators, we can lose the predictability of a bit's state, while ensuring it has a 50% chance of being 0 or 1. As well, since the periods, 65536 and 65535 are coprime, then the overall period of the generator is 65535*65536, which is over 4 billion.&lt;br /&gt;
&lt;br /&gt;
==Combined LFSR/LCG, 32-bit seeds==&lt;br /&gt;
This is similar to the one above, except that it uses 32-bit seeds (and still returns a 16-bit result). An advantage here is that they've been tested and passed randomness tests (all of thee ones offered by CAcert labs). As well, it is still very fast.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
rand32:&lt;br /&gt;
;Inputs:&lt;br /&gt;
;   (seed1_0) holds the lower 16 bits of the first seed&lt;br /&gt;
;   (seed1_1) holds the upper 16 bits of the first seed&lt;br /&gt;
;   (seed2_0) holds the lower 16 bits of the second seed&lt;br /&gt;
;   (seed2_1) holds the upper 16 bits of the second seed&lt;br /&gt;
;   **NOTE: seed2 must be non-zero&lt;br /&gt;
;Outputs:&lt;br /&gt;
;   HL is the result&lt;br /&gt;
;   BC,DE can be used as lower quality values, but are not independent of HL.&lt;br /&gt;
;Destroys:&lt;br /&gt;
;   AF&lt;br /&gt;
;Tested and passes all CAcert tests&lt;br /&gt;
;Uses a very simple 32-bit LCG and 32-bit LFSR&lt;br /&gt;
;it has a period of 18,446,744,069,414,584,320&lt;br /&gt;
;roughly 18.4 quintillion.&lt;br /&gt;
;LFSR taps: 0,2,6,7  = 11000101&lt;br /&gt;
;291cc&lt;br /&gt;
seed1_0=$+1&lt;br /&gt;
    ld hl,12345&lt;br /&gt;
seed1_1=$+1&lt;br /&gt;
    ld de,6789&lt;br /&gt;
    ld b,h&lt;br /&gt;
    ld c,l&lt;br /&gt;
    add hl,hl \ rl e \ rl d&lt;br /&gt;
    add hl,hl \ rl e \ rl d&lt;br /&gt;
    inc l&lt;br /&gt;
    add hl,bc&lt;br /&gt;
    ld (seed1_0),hl&lt;br /&gt;
    ld hl,(seed1_1)&lt;br /&gt;
    adc hl,de&lt;br /&gt;
    ld (seed1_1),hl&lt;br /&gt;
    ex de,hl&lt;br /&gt;
seed2_0=$+1&lt;br /&gt;
    ld hl,9876&lt;br /&gt;
seed2_1=$+1&lt;br /&gt;
    ld bc,54321&lt;br /&gt;
    add hl,hl \ rl c \ rl b&lt;br /&gt;
    ld (seed2_1),bc&lt;br /&gt;
    sbc a,a&lt;br /&gt;
    and %11000101&lt;br /&gt;
    xor l&lt;br /&gt;
    ld l,a&lt;br /&gt;
    ld (seed2_0),hl&lt;br /&gt;
    ex de,hl&lt;br /&gt;
    add hl,bc&lt;br /&gt;
    ret&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Xorshift==&lt;br /&gt;
&lt;br /&gt;
Xorshift is a class of pseudorandom number generators discover by George Marsaglia and detailed in his 2003 paper, ''Xorshift RNGs''. &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 16-bit xorshift pseudorandom number generator by John Metcalf&lt;br /&gt;
; 20 bytes, 86 cycles (excluding ret)&lt;br /&gt;
&lt;br /&gt;
; returns   hl = pseudorandom number&lt;br /&gt;
; corrupts   a&lt;br /&gt;
&lt;br /&gt;
; generates 16-bit pseudorandom numbers with a period of 65535&lt;br /&gt;
; using the xorshift method:&lt;br /&gt;
&lt;br /&gt;
; hl ^= hl &amp;amp;lt;&amp;amp;lt; 7&lt;br /&gt;
; hl ^= hl &amp;amp;gt;&amp;amp;gt; 9&lt;br /&gt;
; hl ^= hl &amp;amp;lt;&amp;amp;lt; 8&lt;br /&gt;
&lt;br /&gt;
; some alternative shift triplets which also perform well are:&lt;br /&gt;
; 6, 7, 13; 7, 9, 13; 9, 7, 13.&lt;br /&gt;
&lt;br /&gt;
  org 32768&lt;br /&gt;
&lt;br /&gt;
xrnd:&lt;br /&gt;
  ld hl,1       ; seed must not be 0&lt;br /&gt;
&lt;br /&gt;
  ld a,h&lt;br /&gt;
  rra&lt;br /&gt;
  ld a,l&lt;br /&gt;
  rra&lt;br /&gt;
  xor h&lt;br /&gt;
  ld h,a&lt;br /&gt;
  ld a,l&lt;br /&gt;
  rra&lt;br /&gt;
  ld a,h&lt;br /&gt;
  rra&lt;br /&gt;
  xor l&lt;br /&gt;
  ld l,a&lt;br /&gt;
  xor h&lt;br /&gt;
  ld h,a&lt;br /&gt;
&lt;br /&gt;
  ld (xrnd+1),hl&lt;br /&gt;
&lt;br /&gt;
  ret&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related Topics ==&lt;br /&gt;
Better random algorithms are available at:&lt;br /&gt;
* [https://gist.github.com/raxoft/c074743ea3f926db0037 Patrik Rak's Xor-Shift random number generator]&lt;br /&gt;
* [https://gist.github.com/raxoft/2275716fea577b48f7f0 Patrik Rak's CMWC random number generator]&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2020-10-04T22:15:45Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: /* Looping with 16 bit counter */ Better optimization and simpler macro&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. It's relatively easy to do (adding 5 bytes and 27/35 T-states to the routine), although this method is only reliable in CMOS Z80 CPUs (NMOS Z80 CPUs have an issue described at bottom left of page 3-130 [http://www.z80.info/zip/ZilogProductSpecsDatabook129-143.pdf here]):&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that, in order to work on all Z80 CPUs (including NMOS Z80), it's necessary to check interrupt status twice within a short interval. This way, if an interrupt occurred exactly during the first test, it could cause a &amp;quot;false negative&amp;quot;, but testing it again quickly before another interrupt could happen would ensure a reliable result, as follows:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  jp pe,label&lt;br /&gt;
  ld a, i  ; test again, to fix potential &amp;quot;false negative&amp;quot; from interrupt occurring at first test&lt;br /&gt;
label:&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+2 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (approximately 13 * n + 9 * (n / 256) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  dec  de&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) ((counter16) &amp;amp; 0xff)&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    jr nz,else    ; the IF condition&lt;br /&gt;
    ;some code&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the ELSE code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
endif:&lt;br /&gt;
; 10 bytes, 33 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
    .db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
endif:&lt;br /&gt;
; 9 bytes, 31 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
This only takes 31 T-states for if. A small saving of 2 T-states, but could be useful in tight loops, and saves 1 byte!&lt;br /&gt;
The only reason not to use this for 1-byte or 2-bytes instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
However, when the ELSE code is a single 2-byte instruction as above, it's usually better to simply execute the ELSE part in all cases, then just skip the IF part depending on a certain condition. Although this option won't be always possible, obviously:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
    jr nz,endif&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
endif:&lt;br /&gt;
; 8 bytes, 28 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this particular example, the code could be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
    jr nz,endif&lt;br /&gt;
    dec a       ; the IF code&lt;br /&gt;
endif:&lt;br /&gt;
; 7 bytes, 25 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot; trick with JP NZ&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* lunarul&lt;br /&gt;
* Einar Saukas&lt;br /&gt;
* Alvin (Alcoholics Anonymous)&lt;br /&gt;
* Metalbrain&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2016-02-15T00:24:39Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Improved text about restoring interrupt status&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. It's relatively easy to do (adding 5 bytes and 27/35 T-states to the routine), although this method is only reliable in CMOS Z80 CPUs (NMOS Z80 CPUs have an issue described at bottom left of page 3-130 [http://www.z80.info/zip/ZilogProductSpecsDatabook129-143.pdf here]):&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that, in order to work on all Z80 CPUs (including NMOS Z80), it's necessary to check interrupt status twice within a short interval. This way, if an interrupt occurred exactly during the first test, it could cause a &amp;quot;false negative&amp;quot;, but testing it again quickly before another interrupt could happen would ensure a reliable result, as follows:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  jp pe,label&lt;br /&gt;
  ld a, i  ; test again, to fix potential &amp;quot;false negative&amp;quot; from interrupt occurring at first test&lt;br /&gt;
label:&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+2 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (approximately 13 * n + 9 * (n / 256) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    jr nz,else    ; the IF condition&lt;br /&gt;
    ;some code&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the ELSE code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
endif:&lt;br /&gt;
; 10 bytes, 33 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
    .db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
endif:&lt;br /&gt;
; 9 bytes, 31 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
This only takes 31 T-states for if. A small saving of 2 T-states, but could be useful in tight loops, and saves 1 byte!&lt;br /&gt;
The only reason not to use this for 1-byte or 2-bytes instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
However, when the ELSE code is a single 2-byte instruction as above, it's usually better to simply execute the ELSE part in all cases, then just skip the IF part depending on a certain condition. Although this option won't be always possible, obviously:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
    jr nz,endif&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
endif:&lt;br /&gt;
; 8 bytes, 28 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this particular example, the code could be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
    jr nz,endif&lt;br /&gt;
    dec a       ; the IF code&lt;br /&gt;
endif:&lt;br /&gt;
; 7 bytes, 25 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot; trick with JP NZ&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* lunarul&lt;br /&gt;
* Einar Saukas&lt;br /&gt;
* Alvin (Alcoholics Anonymous)&lt;br /&gt;
* Metalbrain&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-09-04T13:20:40Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Credited lunarul for pointing out the original &amp;quot;xor&amp;quot; example was broken (already fixed)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. It's relatively easy to do (adding 5 bytes and 27/35 T-states to the routine), although this method is only reliable in CMOS Z80 CPUs (NMOS Z80 CPUs have an issue described at bottom left of page 3-130 [http://www.z80.info/zip/ZilogProductSpecsDatabook129-143.pdf here]):&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+2 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (approximately 13 * n + 9 * (n / 256) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    jr nz,else    ; the IF condition&lt;br /&gt;
    ;some code&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the ELSE code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
endif:&lt;br /&gt;
; 10 bytes, 33 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
    .db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
endif:&lt;br /&gt;
; 9 bytes, 31 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
This only takes 31 T-states for if. A small saving of 2 T-states, but could be useful in tight loops, and saves 1 byte!&lt;br /&gt;
The only reason not to use this for 1-byte or 2-bytes instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
However, when the ELSE code is a single 2-byte instruction as above, it's usually better to simply execute the ELSE part in all cases, then just skip the IF part depending on a certain condition. Although this option won't be always possible, obviously:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
    jr nz,endif&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
endif:&lt;br /&gt;
; 8 bytes, 28 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this particular example, the code could be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
    jr nz,endif&lt;br /&gt;
    dec a       ; the IF code&lt;br /&gt;
endif:&lt;br /&gt;
; 7 bytes, 25 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot; trick with JP NZ&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* lunarul&lt;br /&gt;
* Einar Saukas&lt;br /&gt;
* Alvin (Alcoholics Anonymous)&lt;br /&gt;
* Metalbrain&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-09-04T13:03:03Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Improved &amp;quot;better else&amp;quot; example (thanks to Metalbrain for pointing this out)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. It's relatively easy to do (adding 5 bytes and 27/35 T-states to the routine), although this method is only reliable in CMOS Z80 CPUs (NMOS Z80 CPUs have an issue described at bottom left of page 3-130 [http://www.z80.info/zip/ZilogProductSpecsDatabook129-143.pdf here]):&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+2 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (approximately 13 * n + 9 * (n / 256) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    jr nz,else    ; the IF condition&lt;br /&gt;
    ;some code&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the ELSE code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
endif:&lt;br /&gt;
; 10 bytes, 33 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
    .db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
endif:&lt;br /&gt;
; 9 bytes, 31 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
This only takes 31 T-states for if. A small saving of 2 T-states, but could be useful in tight loops, and saves 1 byte!&lt;br /&gt;
The only reason not to use this for 1-byte or 2-bytes instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
However, when the ELSE code is a single 2-byte instruction as above, it's usually better to simply execute the ELSE part in all cases, then just skip the IF part depending on a certain condition. Although this option won't be always possible, obviously:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
    jr nz,endif&lt;br /&gt;
    ld a,3      ; the IF code&lt;br /&gt;
endif:&lt;br /&gt;
; 8 bytes, 28 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this particular example, the code could be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    ld a,4      ; the ELSE code&lt;br /&gt;
    jr nz,endif&lt;br /&gt;
    dec a       ; the IF code&lt;br /&gt;
endif:&lt;br /&gt;
; 7 bytes, 25 T-states (for IF) or 26 T-states (for ELSE)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot; trick with JP NZ&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;br /&gt;
* Alvin (Alcoholics Anonymous)&lt;br /&gt;
* Metalbrain&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-09-04T12:40:58Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Fixed &amp;quot;better else&amp;quot; timing (thanks to Metalbrain for pointing this out)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. It's relatively easy to do (adding 5 bytes and 27/35 T-states to the routine), although this method is only reliable in CMOS Z80 CPUs (NMOS Z80 CPUs have an issue described at bottom left of page 3-130 [http://www.z80.info/zip/ZilogProductSpecsDatabook129-143.pdf here]):&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+2 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (approximately 13 * n + 9 * (n / 256) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    jr nz,else    ;the IF&lt;br /&gt;
    ;some code&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3&lt;br /&gt;
    jr endif&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
; 10 bytes, 33 or 26 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
    cp 7&lt;br /&gt;
    jr nz,else&lt;br /&gt;
    ld a,3&lt;br /&gt;
    .db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
    ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
; 9 bytes, 31 or 26 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
This only takes 31 T-states for if. A small saving of 2 T-states, but could be useful in tight loops, and saves 1 byte!&lt;br /&gt;
The only reason not to use this for 1 or 2 bytes instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;br /&gt;
* Alvin (Alcoholics Anonymous)&lt;br /&gt;
* Metalbrain&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-09-04T12:20:22Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Fixed &amp;quot;looping with 16 bit counter&amp;quot; timing (thanks to Metalbrain for pointing this out)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. It's relatively easy to do (adding 5 bytes and 27/35 T-states to the routine), although this method is only reliable in CMOS Z80 CPUs (NMOS Z80 CPUs have an issue described at bottom left of page 3-130 [http://www.z80.info/zip/ZilogProductSpecsDatabook129-143.pdf here]):&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+2 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (approximately 13 * n + 9 * (n / 256) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;br /&gt;
* Alvin (Alcoholics Anonymous)&lt;br /&gt;
* Metalbrain&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-09-04T12:12:29Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Fixed &amp;quot;small adjustable delay&amp;quot; timing&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. It's relatively easy to do (adding 5 bytes and 27/35 T-states to the routine), although this method is only reliable in CMOS Z80 CPUs (NMOS Z80 CPUs have an issue described at bottom left of page 3-130 [http://www.z80.info/zip/ZilogProductSpecsDatabook129-143.pdf here]):&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+2 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;br /&gt;
* Alvin (Alcoholics Anonymous)&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T20:42:34Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Fixed counting of how many extra bytes and T-states it takes to check and re-enable interrupts only if required&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. It's relatively easy to do (adding 5 bytes and 27/35 T-states to the routine), although this method is only reliable in CMOS Z80 CPUs (NMOS Z80 CPUs have an issue described at bottom left of page 3-130 [http://www.z80.info/zip/ZilogProductSpecsDatabook129-143.pdf here]):&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;br /&gt;
* Alvin (Alcoholics Anonymous)&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T20:33:06Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Added note about &amp;quot;ld a,i&amp;quot; flags unreliable in NMOS Z80 (thanks to Alcoholics Anonymous for the info)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. It's relatively easy to do (adding 4 bytes and 29/33 T-states to the routine), although this method is only reliable in CMOS Z80 CPUs (NMOS Z80 CPUs have an issue described at bottom left of page 3-130 [http://www.z80.info/zip/ZilogProductSpecsDatabook129-143.pdf here]):&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;br /&gt;
* Alvin (Alcoholics Anonymous)&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T18:15:27Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Improved &amp;quot;Look up Table&amp;quot; example again, this time for table alignment&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you use an aligned table (see section &amp;quot;Table Alignment&amp;quot; below), this code can be optimized even further:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Using 256-byte table alignment&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld (addr+1),a&lt;br /&gt;
addr:&lt;br /&gt;
 ld hl,(VectorTable)&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T18:06:50Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Moved &amp;quot;table alignment&amp;quot; to its own section, and removed incorrect remark &amp;quot;hardly practical&amp;quot; (table alignment is very practical, it's used all the time in Z80 graphic engines where performance is critical!)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Table alignment ==&lt;br /&gt;
&lt;br /&gt;
=== Indexing aligned tables ===&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Incrementing within aligned tables ===&lt;br /&gt;
&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl (2 clocks faster).&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T17:58:30Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Improved &amp;quot;table alignment&amp;quot; example&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Table alignment ====&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; With 256-byte table alignment&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 7 bytes, 31 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Without 256-byte table alignment, simpler version&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 52 clocks&lt;br /&gt;
&lt;br /&gt;
; Without 256-byte table alignment, optimized version&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 add a, sineTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a, sineTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a                          ; Add address of table to index &lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
; 11 bytes, 46 clocks&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Very specific optimizations (hardly practical) ==&lt;br /&gt;
&lt;br /&gt;
=== Table alignment ===&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl.&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T17:39:54Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Minor improvement (saved 1 clock, same number of bytes)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Table alignment ====&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 ld d, 0                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Very specific optimizations (hardly practical) ==&lt;br /&gt;
&lt;br /&gt;
=== Table alignment ===&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl.&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T17:37:20Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Replaced &amp;quot;toggling values in loops&amp;quot; example (original example didn't work and couldn't be improved)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
Consider a board game that needs to alternate between players 1 and 2 at every turn:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 inc a          ; a=2 or 3&lt;br /&gt;
 cp 3&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,1         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 8 bytes, 30 or 32 clocks&lt;br /&gt;
&lt;br /&gt;
;Better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 dec a          ; a=0 or 1&lt;br /&gt;
 jr nz,label&lt;br /&gt;
 ld a,2         ; a=2 or 1&lt;br /&gt;
label:&lt;br /&gt;
; 6 bytes, 23 or 23 clocks&lt;br /&gt;
&lt;br /&gt;
;Even better&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 cpl            ; a=-2 or -3&lt;br /&gt;
 add a,4        ; a=2 or 1, same as calculating 3-a&lt;br /&gt;
; 4 bytes, 18 clocks&lt;br /&gt;
&lt;br /&gt;
;Best&lt;br /&gt;
 ld a,(hl)      ; a=1 or 2&lt;br /&gt;
 xor 3          ; a=2 or 1&lt;br /&gt;
; 3 bytes, 14 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The trick is xor logic make a register alternate between two values.&lt;br /&gt;
&lt;br /&gt;
==== Table alignment ====&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 xor a&lt;br /&gt;
 ld d, a                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Very specific optimizations (hardly practical) ==&lt;br /&gt;
&lt;br /&gt;
=== Table alignment ===&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl.&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T17:23:08Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Created sections for &amp;quot;fallthrough looping&amp;quot;, &amp;quot;toggling values in loops&amp;quot;, and &amp;quot;table alignment&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
==== Fallthrough looping ====&lt;br /&gt;
&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Toggling values in loops ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
loop:&lt;br /&gt;
 ld a,2&lt;br /&gt;
;code1&lt;br /&gt;
 ld a,0&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
&lt;br /&gt;
;try this&lt;br /&gt;
 ld a,2&lt;br /&gt;
loop:&lt;br /&gt;
;code1&lt;br /&gt;
 xor $01   ; the trick is xor logic make a register alternate between two values&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
; -&amp;gt; save size and time depending on its use&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Table alignment ====&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 xor a&lt;br /&gt;
 ld d, a                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Very specific optimizations (hardly practical) ==&lt;br /&gt;
&lt;br /&gt;
=== Table alignment ===&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl.&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T17:10:10Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Improved &amp;quot;look up table&amp;quot; example&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
&lt;br /&gt;
; Best&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 add a,VectorTable%256&lt;br /&gt;
 ld l,a&lt;br /&gt;
 adc a,VectorTable/256&lt;br /&gt;
 sub l&lt;br /&gt;
 ld h,a&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
Fallthrough looping&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Others ====&lt;br /&gt;
&lt;br /&gt;
Toggling values in loops.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
loop:&lt;br /&gt;
 ld a,2&lt;br /&gt;
;code1&lt;br /&gt;
 ld a,0&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
&lt;br /&gt;
;try this&lt;br /&gt;
 ld a,2&lt;br /&gt;
loop:&lt;br /&gt;
;code1&lt;br /&gt;
 xor $01   ; the trick is xor logic make a register alternate between two values&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
; -&amp;gt; save size and time depending on its use&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:Table alignment&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 xor a&lt;br /&gt;
 ld d, a                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Very specific optimizations (hardly practical) ==&lt;br /&gt;
&lt;br /&gt;
=== Table alignment ===&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl.&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T17:07:05Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Fixed &amp;quot;look up table&amp;quot; example (original and optimized versions now have exactly the same behavior)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Look up Table ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
Fallthrough looping&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Others ====&lt;br /&gt;
&lt;br /&gt;
Toggling values in loops.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
loop:&lt;br /&gt;
 ld a,2&lt;br /&gt;
;code1&lt;br /&gt;
 ld a,0&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
&lt;br /&gt;
;try this&lt;br /&gt;
 ld a,2&lt;br /&gt;
loop:&lt;br /&gt;
;code1&lt;br /&gt;
 xor $01   ; the trick is xor logic make a register alternate between two values&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
; -&amp;gt; save size and time depending on its use&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:Table alignment&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 xor a&lt;br /&gt;
 ld d, a                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Very specific optimizations (hardly practical) ==&lt;br /&gt;
&lt;br /&gt;
=== Table alignment ===&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl.&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T16:58:35Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Improved &amp;quot;a*12&amp;quot; example&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; Although this specific case could be even better as follows:&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a		; a*2&lt;br /&gt;
	add a,l		; a*3&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a		; hl=a*3&lt;br /&gt;
	add hl,hl	; hl=a*6&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 8 bytes, 45 clocks&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you both shaved a few clock cycles and saved some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 sub 2&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 .dw A_is_6&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
Fallthrough looping&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Others ====&lt;br /&gt;
&lt;br /&gt;
Toggling values in loops.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
loop:&lt;br /&gt;
 ld a,2&lt;br /&gt;
;code1&lt;br /&gt;
 ld a,0&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
&lt;br /&gt;
;try this&lt;br /&gt;
 ld a,2&lt;br /&gt;
loop:&lt;br /&gt;
;code1&lt;br /&gt;
 xor $01   ; the trick is xor logic make a register alternate between two values&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
; -&amp;gt; save size and time depending on its use&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:Table alignment&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 xor a&lt;br /&gt;
 ld d, a                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Very specific optimizations (hardly practical) ==&lt;br /&gt;
&lt;br /&gt;
=== Table alignment ===&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl.&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T16:52:11Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Improved &amp;quot;fallthrough looping&amp;quot; example&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; hey, minus 7 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you only shaved a few clock cycles but sometimes you can save some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 sub 2&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 .dw A_is_6&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
Fallthrough looping&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Although this specific case would be even better (same size but shorter) as follows:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar2     ; Run routine four times&lt;br /&gt;
bar2:&lt;br /&gt;
  call bar      ; Run routine twice&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; Run routine once&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Others ====&lt;br /&gt;
&lt;br /&gt;
Toggling values in loops.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
loop:&lt;br /&gt;
 ld a,2&lt;br /&gt;
;code1&lt;br /&gt;
 ld a,0&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
&lt;br /&gt;
;try this&lt;br /&gt;
 ld a,2&lt;br /&gt;
loop:&lt;br /&gt;
;code1&lt;br /&gt;
 xor $01   ; the trick is xor logic make a register alternate between two values&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
; -&amp;gt; save size and time depending on its use&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:Table alignment&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 xor a&lt;br /&gt;
 ld d, a                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Very specific optimizations (hardly practical) ==&lt;br /&gt;
&lt;br /&gt;
=== Table alignment ===&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl.&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Optimization</id>
		<title>Z80 Optimization</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Optimization"/>
				<updated>2015-08-31T16:46:24Z</updated>
		
		<summary type="html">&lt;p&gt;Einar: Fixed &amp;quot;cp 9&amp;quot; example&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.&lt;br /&gt;
&lt;br /&gt;
If you are just looking for cutting some bytes go straight to small tricks in this topic.&lt;br /&gt;
&lt;br /&gt;
== Registers and Memory ==&lt;br /&gt;
Generally good algorithms on z80 use registers in a appropriate form.&lt;br /&gt;
It is also a good practise to keep a convention and plan how you are going to use the registers.&lt;br /&gt;
&lt;br /&gt;
General use of registers:&lt;br /&gt;
* a - 8-bit accumulator&lt;br /&gt;
* b - counter&lt;br /&gt;
* c,d,e,h,l auxiliary to accumulator and copy of b or a&lt;br /&gt;
&lt;br /&gt;
* hl - 16-bit accumulator/pointer of a address memory&lt;br /&gt;
* de - pointer of a destination address memory&lt;br /&gt;
* bc - 16-bit counter&lt;br /&gt;
* ix - index register/pointer to table in memory/save copy of hl/pointer to memory when hl and de are being used&lt;br /&gt;
* iy - index register/pointer to table in memory (use when there is no other option or need optimal execution) (disable interrupts and on exit restore the original value because TI-OS uses)&lt;br /&gt;
&lt;br /&gt;
=== 8-bit vs. 16-bit Operations ===&lt;br /&gt;
&lt;br /&gt;
The z80 processor makes faster operations on 8-bit values.&lt;br /&gt;
Code dealing with 16-bit register tends to be bigger and slower because of the equivalent 16-bit instruction is slower or it does not exist and needs to be replaced with more instructions. And sometimes the equivalent 16-bit instruction is 1 more byte.&lt;br /&gt;
If you use ix or iy registers operations are even slower and always are 1 byte bigger for each instruction. So try to convert your code to use hl and de instead of ix and iy.&lt;br /&gt;
&lt;br /&gt;
In a practical example, imagine:&lt;br /&gt;
- you pass through the accumulator a value to a routine&lt;br /&gt;
- if the only valid values of the accumulator range from 0 to 63 and if in that routine you need to multiply the accumulator by, say 12, it has to be stored in a 16-bit pair register.&lt;br /&gt;
- but you can multiply a by 4 before overflowing (63*4 = 252 which is smaller than 255) and take advantage of this to optimize&lt;br /&gt;
&lt;br /&gt;
Now on the code:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; The most usual way is pass A (the accumulator) right in the start to HL&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld l,a&lt;br /&gt;
	add a,a&lt;br /&gt;
	ld d,h&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl	; hl=a*12&lt;br /&gt;
; 9 bytes, 56 clocks&lt;br /&gt;
&lt;br /&gt;
; But given a is between 0 and 63 you can multiply by 4 without overflowing the 8-bit limit (255)&lt;br /&gt;
	add a,a&lt;br /&gt;
	add a,a		; a*4&lt;br /&gt;
	ld l,a&lt;br /&gt;
	ld e,a&lt;br /&gt;
	ld h,0&lt;br /&gt;
	ld d,h		; hl=a*4 and de=a*4&lt;br /&gt;
	add hl,hl	; hl=a*8&lt;br /&gt;
	add hl,de	; hl=a*12&lt;br /&gt;
; 9 bytes, 49 clocks&lt;br /&gt;
&lt;br /&gt;
; hey, minus 7 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example you only shaved a few clock cycles but sometimes you can save some bytes, too.&lt;br /&gt;
You can do this for other registers than A accumulator.&lt;br /&gt;
&lt;br /&gt;
For example if passed in l and l is always lower than 64, you can do &amp;quot; sla l \ sla l \ ld h,0	&amp;quot; to multiply l by four and use hl for 16-bit operations. In this case you are exchanging size with speed increase. Each sla instruction is 2 bytes and add hl,hl is only 1 byte.&lt;br /&gt;
&lt;br /&gt;
Mind this optimizations can produce bugs and somewhat hard code to follow, so comment them.&lt;br /&gt;
I recommend to proceed to this optimization only when you really need speed and the code is bug free.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One common trick with multiplication by 256 is just load around the low byte register to the high byte register. This works because in binary a multiplication by 256 is like shifting 8 bits left, entering zeros. Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; multiply a by 256 and store in hl&lt;br /&gt;
	ld h,a&lt;br /&gt;
	ld l,0&lt;br /&gt;
; multiply hl by 256 and store in ade (pseudo 24-bit pair register)&lt;br /&gt;
	ld a,h&lt;br /&gt;
	ld d,l&lt;br /&gt;
	ld e,0&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you are out of registers, try using ixh/ixl/iyh/iyl  and even the i register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop. Using ixh/ixl/iyh/iyl will break compatibility with the TI-84+SE emulated by the Nspire. You can only use i register for other purposes if you disable interrupts first (di).&lt;br /&gt;
&lt;br /&gt;
=== Shadow registers ===&lt;br /&gt;
&lt;br /&gt;
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'&lt;br /&gt;
&lt;br /&gt;
These registers behave like their &amp;quot;standard&amp;quot; counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ex af, af'  ; swaps af and af' as the mnemonic indicates&lt;br /&gt;
&lt;br /&gt;
 exx         ; swaps bc, de, hl and bc', de', hl'&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers are somewhat common for doing arithmetic operations on some big integers (16-bit to 32-bit) or BCD operations without rely on RAM storage or pushing and popping to the stack. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
MUL32:&lt;br /&gt;
        DI&lt;br /&gt;
        AND     A               ; RESET CARRY FLAG&lt;br /&gt;
        SBC     HL,HL           ; LOWER RESULT = 0&lt;br /&gt;
        EXX&lt;br /&gt;
        SBC     HL,HL           ; HIGHER RESULT = 0&lt;br /&gt;
        LD      A,B             ; MPR IS AC'BC&lt;br /&gt;
        LD      B,32            ; INITIALIZE LOOP COUNTER&lt;br /&gt;
MUL32LOOP:&lt;br /&gt;
        SRA     A               ; RIGHT SHIFT MPR&lt;br /&gt;
        RR      C&lt;br /&gt;
        EXX&lt;br /&gt;
        RR      B&lt;br /&gt;
        RR      C               ; LOWEST BIT INTO CARRY&lt;br /&gt;
        JR      NC,MUL32NOADD&lt;br /&gt;
        ADD     HL,DE           ; RESULT += MPD&lt;br /&gt;
        EXX&lt;br /&gt;
        ADC     HL,DE&lt;br /&gt;
        EXX&lt;br /&gt;
MUL32NOADD:&lt;br /&gt;
        SLA     E               ; LEFT SHIFT MPD&lt;br /&gt;
        RL      D&lt;br /&gt;
        EXX&lt;br /&gt;
        RL      E&lt;br /&gt;
        RL      D&lt;br /&gt;
        DJNZ    MUL32LOOP&lt;br /&gt;
        EXX&lt;br /&gt;
       &lt;br /&gt;
; RESULT IN H'L'HL&lt;br /&gt;
        RET&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shadow registers can be of a great help but they come with two drawbacks :&lt;br /&gt;
&lt;br /&gt;
* they cannot coexist with the &amp;quot;standard&amp;quot; registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; loads hl' with the contents of hl&lt;br /&gt;
 push hl&lt;br /&gt;
 exx&lt;br /&gt;
 pop hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld a, i  ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point&lt;br /&gt;
  push af  ; save flags&lt;br /&gt;
  di       ; disable interrupts&lt;br /&gt;
  &lt;br /&gt;
  ; do something with shadow registers here&lt;br /&gt;
&lt;br /&gt;
  pop af   ; get back flags&lt;br /&gt;
  ret po   ; po = P/V reset so in this case it means interrupts were disabled before the routine was called&lt;br /&gt;
  ei       ; re-enable interrupts&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
: Note that this produces ugly and very hard code to follow, so comment it very well for understanding and debugging later.&lt;br /&gt;
&lt;br /&gt;
=== SP register ===&lt;br /&gt;
&lt;br /&gt;
This register is used in desperate situations generally during an interrupt loop demanding as much speed as possible and the normal registers are used. (remarkably used in James Montelongo 4 lvl grayscale interlace in graylib2.inc)&lt;br /&gt;
You need to know these valid and not generally known instructions:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld sp,6&lt;br /&gt;
 add hl,sp&lt;br /&gt;
 sbc hl,sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 dec sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Now an example of such situation:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;init hl,de,bc,a&lt;br /&gt;
 ld sp,6&lt;br /&gt;
loop:&lt;br /&gt;
;code&lt;br /&gt;
 add hl,sp  ;get next row of a table for example&lt;br /&gt;
;code using bc,de,ix,a&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 jp nz,loop:&lt;br /&gt;
;code&lt;br /&gt;
 ld sp,(saveSP)&lt;br /&gt;
 ret    ;finish interrupt&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
When you use sp in this way this means you can not push/pop registers and no calls are allowed.&lt;br /&gt;
Mind again that this is only used as last resource. Don't forget to save and restore sp like the example shows.&lt;br /&gt;
&lt;br /&gt;
=== Stack ===&lt;br /&gt;
&lt;br /&gt;
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.&lt;br /&gt;
&lt;br /&gt;
==== Allocation ====&lt;br /&gt;
&lt;br /&gt;
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.&lt;br /&gt;
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes&lt;br /&gt;
 ld hl, -7&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Access ====&lt;br /&gt;
&lt;br /&gt;
The most common way of accessing data allocated on stack is to use an index register since all allocated &amp;quot;variables&amp;quot; can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 4 bytes, 19 T-states&lt;br /&gt;
 ld c, (ix + n)   ; n is an immediate value in -128..127&lt;br /&gt;
 &lt;br /&gt;
 ; 4 bytes, 17 T-states, destroys a&lt;br /&gt;
 ld a, (somelocation)&lt;br /&gt;
 ld c, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; 3 bytes, 19 T-states&lt;br /&gt;
 cp (ix + n)&lt;br /&gt;
&lt;br /&gt;
 sub (ix + n)&lt;br /&gt;
 sbc a, (ix + n)&lt;br /&gt;
 add a, (ix + n)&lt;br /&gt;
 adc a, (ix + n)&lt;br /&gt;
&lt;br /&gt;
 inc (ix + n)&lt;br /&gt;
 dec (ix + n)&lt;br /&gt;
&lt;br /&gt;
 and (ix + n)&lt;br /&gt;
 or (ix + n)&lt;br /&gt;
 xor (ix + n)&lt;br /&gt;
&lt;br /&gt;
 ; 4 bytes, 23 T-states&lt;br /&gt;
 rl (ix + n)&lt;br /&gt;
 rr (ix + n)&lt;br /&gt;
 rlc (ix + n)&lt;br /&gt;
 rrc (ix + n)&lt;br /&gt;
 sla (ix + n)&lt;br /&gt;
 sra (ix + n)&lt;br /&gt;
 sll (ix + n)&lt;br /&gt;
 srl (ix + n)&lt;br /&gt;
 bit k, (ix + n)   ; k is an immediate value in 0..7&lt;br /&gt;
 set k, (ix + n)&lt;br /&gt;
 res k, (ix + n)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).&lt;br /&gt;
&lt;br /&gt;
==== Deallocation ====&lt;br /&gt;
&lt;br /&gt;
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...&lt;br /&gt;
 inc sp&lt;br /&gt;
 inc sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop&lt;br /&gt;
 ld hl, 16&lt;br /&gt;
 add hl, sp&lt;br /&gt;
 ld sp, hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt; &lt;br /&gt;
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)&lt;br /&gt;
&lt;br /&gt;
== General Algorithms ==&lt;br /&gt;
&lt;br /&gt;
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.&lt;br /&gt;
&lt;br /&gt;
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.&lt;br /&gt;
&lt;br /&gt;
Do not forget that in z80 assembly vector tables (or look up tables) gives smaller and faster code than blocks of comparisons and jumps. Other times using a chunk of data for a task is better than a more usual programming method (notably in graphics screen effects).&lt;br /&gt;
See [[Z80 Good Programming Practices]] for examples.&lt;br /&gt;
&lt;br /&gt;
Look up in a complete instruction set for searching some instruction that can optimize somewhere in the code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A list of things to keep in mind:&lt;br /&gt;
* Rework conditionals to be more efficient.&lt;br /&gt;
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.&lt;br /&gt;
* Get out of the main loop special cases check if they aren't needed there.&lt;br /&gt;
* Rearrange program flow&lt;br /&gt;
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.&lt;br /&gt;
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.&lt;br /&gt;
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)&lt;br /&gt;
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.&lt;br /&gt;
* Document wacky optimizations to understand the code later (z80 optimization leads to very hard code to understand)&lt;br /&gt;
&lt;br /&gt;
== Self Modifying Code ==&lt;br /&gt;
&lt;br /&gt;
If your code is in ram, writes can be done to change the code. Having a instruction set that explains the opcodes is useful.&lt;br /&gt;
Despite the self modifying code can be used in any instruction, it is very common with loading constants to registers.&lt;br /&gt;
&lt;br /&gt;
Generally it is used to save any value to be used later (usually seen in masks). Examples:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (savemask),a&lt;br /&gt;
;...code...&lt;br /&gt;
savemask = $+1&lt;br /&gt;
 ld a,$00   ; $00 is just a placeholder&lt;br /&gt;
&lt;br /&gt;
 ld (something),hl&lt;br /&gt;
;... code&lt;br /&gt;
something = $+1&lt;br /&gt;
 ld de,$0000&lt;br /&gt;
&lt;br /&gt;
 ld (saveSP),sp&lt;br /&gt;
;... code ...&lt;br /&gt;
saveSP = $+1&lt;br /&gt;
 ld sp,$0000  ; restore sp&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
SMC (Self Modifying Code) is quite used with unrolling and relative jumps. Example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld (jpmodify),a&lt;br /&gt;
;...&lt;br /&gt;
jpmodify = $+1&lt;br /&gt;
 jr $00&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 rrca&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another SMC is modifying load instructions with (ix+0) and change the 0 to other values to really quickly read and write to the nth element of a list without using any extra registers.&lt;br /&gt;
&lt;br /&gt;
== Small Tricks ==&lt;br /&gt;
&lt;br /&gt;
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.&lt;br /&gt;
&lt;br /&gt;
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.&lt;br /&gt;
&lt;br /&gt;
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.&lt;br /&gt;
&lt;br /&gt;
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.&lt;br /&gt;
&lt;br /&gt;
=== Optimize size and speed ===&lt;br /&gt;
&lt;br /&gt;
==== Loading stuff ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 ld a,0&lt;br /&gt;
;Try this:&lt;br /&gt;
 xor a    ;disadvantages: changes flags&lt;br /&gt;
;or&lt;br /&gt;
 sub a    ;disadvantages: changes flags&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld b,$20&lt;br /&gt;
	ld c,$30&lt;br /&gt;
;try this&lt;br /&gt;
	ld bc,$2030&lt;br /&gt;
;or this&lt;br /&gt;
	ld bc,(b_num * 256) + c_num		;where b_num goes to b register and c_num to c register&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
  ld a,$42&lt;br /&gt;
  ld (hl),a&lt;br /&gt;
;try this&lt;br /&gt;
  ld (hl),$42&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (data1),a&lt;br /&gt;
	ld (data2),a&lt;br /&gt;
	ld (data3),a&lt;br /&gt;
	ld (data4),a&lt;br /&gt;
	ld (data5),a	;if data1 to data5 are one after the other&lt;br /&gt;
;try this&lt;br /&gt;
	ld hl,data1&lt;br /&gt;
	ld de,data1+1&lt;br /&gt;
	xor a&lt;br /&gt;
	ld (hl),a&lt;br /&gt;
	ld bc,4&lt;br /&gt;
	ldir&lt;br /&gt;
; -&amp;gt; save 3 bytes for every ld (dataX), after passing the initial overhead&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	ld a,(var)&lt;br /&gt;
	inc a&lt;br /&gt;
	ld (var),a&lt;br /&gt;
;try this	;Note: if hl is not tied up, use indirection:&lt;br /&gt;
	ld hl,var&lt;br /&gt;
	inc (hl)&lt;br /&gt;
	ld a,(hl) ;if you don't need (hl) in a, delete this line&lt;br /&gt;
; -&amp;gt; save 2 bytes and 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of :&lt;br /&gt;
 ld a, (hl)&lt;br /&gt;
 ld (de), a&lt;br /&gt;
 inc hl&lt;br /&gt;
 inc de&lt;br /&gt;
; Use :&lt;br /&gt;
 ldi&lt;br /&gt;
 inc bc&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop BC&lt;br /&gt;
    ld D,B&lt;br /&gt;
    ld E,C&lt;br /&gt;
;Use instead:&lt;br /&gt;
    push BC&lt;br /&gt;
;    ...&lt;br /&gt;
    pop DE      ;we only want to DE hold pushed BC (no need for a copy of DE in BC)&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Math and Logic tricks ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of:&lt;br /&gt;
 cp 0&lt;br /&gt;
;Use&lt;br /&gt;
 or a&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  cp 1&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  dec a   ;changes a!&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  xor %11111111&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  cpl&lt;br /&gt;
; -&amp;gt; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,767&lt;br /&gt;
    or a       ;reset carry so sbc works as a sub&lt;br /&gt;
    sbc hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    ld de,-767 ;negation of de&lt;br /&gt;
    add hl,de&lt;br /&gt;
; -&amp;gt; 2 bytes and 8 T-states !&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
    ld de,-767&lt;br /&gt;
    add hl,de&lt;br /&gt;
;try this&lt;br /&gt;
    dec h  ; -256&lt;br /&gt;
    dec h  ; -512&lt;br /&gt;
    dec h  ; -768&lt;br /&gt;
    inc hl  ; -767&lt;br /&gt;
;Note that works in many other cases&lt;br /&gt;
; -&amp;gt; save 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
	srl a&lt;br /&gt;
;try this&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	rrca&lt;br /&gt;
	and %00011111&lt;br /&gt;
; -&amp;gt; save 1 byte and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
	neg&lt;br /&gt;
	add a,N   ;you want to calculate N-A&lt;br /&gt;
;Do it this way:&lt;br /&gt;
	cpl&lt;br /&gt;
	add a,N+1    ;neg is practically equivalent to cpl \ inc a&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,B&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    xor A&lt;br /&gt;
    sub B&lt;br /&gt;
; -&amp;gt; save 1 byte and 4 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    ld A,D&lt;br /&gt;
    sub $D3&lt;br /&gt;
    neg&lt;br /&gt;
;Instead use:&lt;br /&gt;
    ld A,$D3&lt;br /&gt;
    sub D&lt;br /&gt;
; -&amp;gt; save 2 bytes and 8 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  sla l&lt;br /&gt;
  rl h         ; I've actually seen this!&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  add hl,hl&lt;br /&gt;
; -&amp;gt; save 3 bytes and 5 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Conditionals ====&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and 1         ;and sets zero flag, no need for cp&lt;br /&gt;
  jr nz,foo&lt;br /&gt;
; -&amp;gt; save 2 bytes and 7 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  and 1&lt;br /&gt;
  cp 1         ;a not needed after this&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  jr c,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 0,a&lt;br /&gt;
  call z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rra&lt;br /&gt;
  call nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 7,a&lt;br /&gt;
  jr z,foo&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  rla&lt;br /&gt;
  jr nc,foo&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  bit 2,a&lt;br /&gt;
  ret nz&lt;br /&gt;
  xor a&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
  and %100&lt;br /&gt;
  ret nz&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of:&lt;br /&gt;
  cp 9        ;if a&amp;lt;=9 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
  jp z,label&lt;br /&gt;
&lt;br /&gt;
; Use this:&lt;br /&gt;
  cp 9+1      ;if a&amp;lt;10 then goto label&lt;br /&gt;
  jp c,label&lt;br /&gt;
&lt;br /&gt;
; -&amp;gt; save 3 bytes and 10 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Code Flow ====&lt;br /&gt;
&lt;br /&gt;
Almost never call and return...&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 call xxxx&lt;br /&gt;
 ret&lt;br /&gt;
;try this&lt;br /&gt;
 jp xxxx&lt;br /&gt;
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.&lt;br /&gt;
; -&amp;gt; save 1 byte and 17 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Never use:&lt;br /&gt;
    dec B&lt;br /&gt;
    jr NZ,loop    ;I have seen this...&lt;br /&gt;
;Use:&lt;br /&gt;
    djnz loop&lt;br /&gt;
; save 1 byte and 3 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; Instead of&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 cp 0&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 cp 1&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 cp 2&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 cp 3&lt;br /&gt;
 jp z,A_is_3&lt;br /&gt;
 cp 4&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 cp 5&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; This is a little better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 or a&lt;br /&gt;
 jp z,A_is_0&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_1&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_2&lt;br /&gt;
 sub 2&lt;br /&gt;
 jp z,A_is_4&lt;br /&gt;
 dec a&lt;br /&gt;
 jp z,A_is_5&lt;br /&gt;
&lt;br /&gt;
; Even better&lt;br /&gt;
 ld a,(Number)&lt;br /&gt;
 add a,a   ; a*2 (limits Number to 128) &lt;br /&gt;
 ld h,0 &lt;br /&gt;
 ld l,a &lt;br /&gt;
 ld de,VectorTable&lt;br /&gt;
 add hl,de&lt;br /&gt;
 ld a,(hl)&lt;br /&gt;
 inc hl&lt;br /&gt;
 ld h,(hl)&lt;br /&gt;
 ld l,a&lt;br /&gt;
 jp (hl)&lt;br /&gt;
&lt;br /&gt;
VectorTable:&lt;br /&gt;
 .dw A_is_1&lt;br /&gt;
 .dw A_is_2&lt;br /&gt;
 .dw A_is_3&lt;br /&gt;
 .dw A_is_4&lt;br /&gt;
 .dw A_is_5&lt;br /&gt;
 .dw A_is_6&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
Also see [[Z80 Good Programming Practices]]&lt;br /&gt;
&lt;br /&gt;
Fallthrough looping&lt;br /&gt;
If you need to repeat a routine several times but can't spare registers for a loop counter or unroll the routine, try structuring the routine so it can call itself several times and fall through at the end. For example:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
foo:&lt;br /&gt;
  ld hl, data&lt;br /&gt;
  call bar      ; Run routine once&lt;br /&gt;
  call bar      ; .. twice&lt;br /&gt;
  call bar      ; .. three times&lt;br /&gt;
bar:&lt;br /&gt;
  ld a, (hl)    ; .. fourth and final time&lt;br /&gt;
  inc l&lt;br /&gt;
  and $0F&lt;br /&gt;
  out (c), a&lt;br /&gt;
  ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Others ====&lt;br /&gt;
&lt;br /&gt;
Toggling values in loops.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
loop:&lt;br /&gt;
 ld a,2&lt;br /&gt;
;code1&lt;br /&gt;
 ld a,0&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
&lt;br /&gt;
;try this&lt;br /&gt;
 ld a,2&lt;br /&gt;
loop:&lt;br /&gt;
;code1&lt;br /&gt;
 xor $01   ; the trick is xor logic make a register alternate between two values&lt;br /&gt;
;code2&lt;br /&gt;
 djnz loop&lt;br /&gt;
; -&amp;gt; save size and time depending on its use&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:Table alignment&lt;br /&gt;
&lt;br /&gt;
If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as l and the table address in h. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient: &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld h, (sineTable &amp;gt;&amp;gt; 8) &amp;amp; $FF    ; Get MSB of table&lt;br /&gt;
 ld a, (frame_count)             ; Get index&lt;br /&gt;
 ld l, a&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 ld hl, sineTable                ; Get address of table&lt;br /&gt;
 xor a&lt;br /&gt;
 ld d, a                         ; Set index high byte to zero&lt;br /&gt;
 ld a, (frame_count)&lt;br /&gt;
 ld e, a                         ; Set index low byte&lt;br /&gt;
 add hl, de                      ; Add offset to base&lt;br /&gt;
 ld a, (hl)                      ; Look up value&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Size vs. Speed ===&lt;br /&gt;
&lt;br /&gt;
The classical problem of optimization in computer programming, Z80 is no exception.&lt;br /&gt;
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.&lt;br /&gt;
&lt;br /&gt;
==== For the sake of size ====&lt;br /&gt;
&lt;br /&gt;
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;lots of code (more that 128 bytes worth of code)&lt;br /&gt;
somelabel2:&lt;br /&gt;
 jp somelabel&lt;br /&gt;
;less than 128 bytes&lt;br /&gt;
 jr somelabel2   ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 dec bc&lt;br /&gt;
 ld a,b&lt;br /&gt;
 or c&lt;br /&gt;
 ret z&lt;br /&gt;
;try this&lt;br /&gt;
 cpi              ;increments HL&lt;br /&gt;
 ret po&lt;br /&gt;
; save 1 byte at the cost of 2 T-states&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Passing inline data'''&lt;br /&gt;
&lt;br /&gt;
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
Instead of:&lt;br /&gt;
 ld hl,string&lt;br /&gt;
 bcall(_vputs)&lt;br /&gt;
 ret&lt;br /&gt;
;Try this:&lt;br /&gt;
  call Disp&lt;br /&gt;
  .db &amp;quot;This is some text&amp;quot;,0&lt;br /&gt;
  ret&lt;br /&gt;
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.&lt;br /&gt;
;It also heavily disturbs disassembly.&lt;br /&gt;
Disp:&lt;br /&gt;
  pop hl&lt;br /&gt;
  bcall(_vputs)&lt;br /&gt;
  jp (hl)&lt;br /&gt;
; -&amp;gt; save 2 bytes for each use, but 4 bytes of overhead (Disp routine)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This routine can be expanded to pass the coordinates where the text should appear.&lt;br /&gt;
&lt;br /&gt;
'''Wasting time to delay'''&lt;br /&gt;
&lt;br /&gt;
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Instead of&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
 in a,(1)&lt;br /&gt;
;Try this:&lt;br /&gt;
 ld a,KEY_GROUP&lt;br /&gt;
 out (1),a&lt;br /&gt;
 ld a,(de)    ;a doesn't need to be preserved because it will hold what the port has.&lt;br /&gt;
 in a,(1)&lt;br /&gt;
; -&amp;gt; save 1 byte and 1 T-state (well 1 T-state less is almost the same time)&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 2 bytes, 8 T-states&lt;br /&gt;
 nop&lt;br /&gt;
 nop&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 inc hl&lt;br /&gt;
 dec hl&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 12 T-states&lt;br /&gt;
 jr $+2&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 21 T-states&lt;br /&gt;
 push af&lt;br /&gt;
 pop af&lt;br /&gt;
&lt;br /&gt;
; 2 bytes, 38 T-states&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 ex (sp), hl&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you need a small adjustable delay:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;4 bytes, b*13+8 T-states (variable)&lt;br /&gt;
	ld b,255	; initial delay&lt;br /&gt;
	djnz $		; do it&lt;br /&gt;
;b=0 on exit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* There are many other instructions that you can use&lt;br /&gt;
* Beware that not all instructions preserve registers or flags&lt;br /&gt;
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).&lt;br /&gt;
&lt;br /&gt;
==== Unrolling code ====&lt;br /&gt;
&lt;br /&gt;
'''General Unrolling'''&lt;br /&gt;
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.&lt;br /&gt;
This means you are wasting memory to gain speed. Most times you are preferring size to speed.&lt;br /&gt;
&lt;br /&gt;
'''Unroll commands'''&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; &amp;quot;Classic&amp;quot; way : ~21 T-states per byte copied&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size&lt;br /&gt;
 ldir&lt;br /&gt;
&lt;br /&gt;
; Unrolled : (16 * size + 10) / n -&amp;gt; ~18 T-states per byte copied when unrolling 8 times&lt;br /&gt;
 ld hl,src&lt;br /&gt;
 ld de,dest&lt;br /&gt;
 ld bc,size  ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration&lt;br /&gt;
loopldi:    ;you can use this entry for a call&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 ldi&lt;br /&gt;
 jp pe, loopldi    ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size&lt;br /&gt;
; ret if this is a subroutine and use the unrolled ldi's with a call.&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
This unroll of ldi also works with outi and ldr.&lt;br /&gt;
&lt;br /&gt;
==== Looping with 16 bit counter ====&lt;br /&gt;
There are two ways to make loops with a 16bit counter :&lt;br /&gt;
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  ld bc, ...&lt;br /&gt;
loop:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
 &lt;br /&gt;
  dec bc&lt;br /&gt;
  ld  a, b&lt;br /&gt;
  or  c&lt;br /&gt;
  jp  nz,loop&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  dec  de&lt;br /&gt;
  ld  b, e&lt;br /&gt;
  inc  b&lt;br /&gt;
  inc  d&lt;br /&gt;
loop2:&lt;br /&gt;
  ; loop body here&lt;br /&gt;
  &lt;br /&gt;
  djnz loop2&lt;br /&gt;
  dec  d&lt;br /&gt;
  jp  nz,loop2&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
The rationale behind the second method is to reduce the overhead of the &amp;quot;inner&amp;quot; loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. &lt;br /&gt;
&lt;br /&gt;
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
  #define inner_counter8(counter16) (((counter16) - 1) &amp;amp; 0xff) + 1&lt;br /&gt;
  #define outer_counter8(counter16) (((counter16) - 1) &amp;gt;&amp;gt; 8) + 1&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Preserve Registers ===&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: both routines compare b to 0, same size and speed but the second preserves accumulator&lt;br /&gt;
; remarks: - inc/dec doesn't affect carry flag&lt;br /&gt;
;          - inc/dec doesn't affect any flags on 16-bit registers, so do not extrapolate to 16-bit registers.&lt;br /&gt;
	ld a,b&lt;br /&gt;
	or b&lt;br /&gt;
	jr z,label&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	inc b&lt;br /&gt;
	dec b&lt;br /&gt;
	jr z,label&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; description: add a to hl without using a 16-bit register&lt;br /&gt;
;normal way:&lt;br /&gt;
	ld d,$00&lt;br /&gt;
	ld e,a&lt;br /&gt;
	add hl,de&lt;br /&gt;
;4 bytes and 22 clock cycles&lt;br /&gt;
; &amp;gt;&lt;br /&gt;
	add a,l&lt;br /&gt;
	ld l,a&lt;br /&gt;
	jr nc, $+3&lt;br /&gt;
	inc h&lt;br /&gt;
;5 bytes, 19/20 clock cycles&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Setting flags ==&lt;br /&gt;
In some occasion you might want to selectively set/reset a flag.&lt;br /&gt;
&lt;br /&gt;
Here are the most common uses :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; set Carry flag&lt;br /&gt;
 scf&lt;br /&gt;
&lt;br /&gt;
; reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
; alternate reset Carry flag (alters Sign and Zero flags as defined)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
; set Zero flag (resets Carry flag, alters Sign flag as defined)&lt;br /&gt;
 cp a&lt;br /&gt;
&lt;br /&gt;
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)&lt;br /&gt;
 or 1&lt;br /&gt;
&lt;br /&gt;
; set Sign flag (negative) (alters a, reset Zero and Carry flags)&lt;br /&gt;
 or $80&lt;br /&gt;
&lt;br /&gt;
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)&lt;br /&gt;
 xor a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Other possible uses (much rarer) :&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;Set parity/overflow (even):&lt;br /&gt;
 xor a&lt;br /&gt;
&lt;br /&gt;
;Reset parity/overflow (odd):&lt;br /&gt;
 sub a&lt;br /&gt;
&lt;br /&gt;
;Set half carry (hardly ever useful but still...)&lt;br /&gt;
 and a&lt;br /&gt;
&lt;br /&gt;
;Reset half carry (hardly ever useful but still...)&lt;br /&gt;
 or a&lt;br /&gt;
&lt;br /&gt;
;Set bit 5 of f:&lt;br /&gt;
 or %00100000&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As you can see these are extremely simple, small and fast ways to alter flags&lt;br /&gt;
which make them interesting as output of routines to indicate error/success or&lt;br /&gt;
other status bits that do not require a full register.&lt;br /&gt;
&lt;br /&gt;
Were you to use this, remember that these flag (re)setting tricks frequently&lt;br /&gt;
overlap so if you need a special combination of flags it might require slightly&lt;br /&gt;
more elaborate tricks. As a rule of a thumb, always alter the carry last in&lt;br /&gt;
such cases because the scf and ccf instructions do not have side effects.&lt;br /&gt;
&lt;br /&gt;
More advance ways of manipulating flags follow:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
;get the zero flag in carry &lt;br /&gt;
	scf&lt;br /&gt;
	jr z,$+3&lt;br /&gt;
	ccf&lt;br /&gt;
&lt;br /&gt;
;Put carry flag into zero flag.&lt;br /&gt;
	ccf&lt;br /&gt;
	sbc a, a&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tools of the job ==&lt;br /&gt;
&lt;br /&gt;
Want to try test your optimization or test new ones? Then you have to check this:&lt;br /&gt;
* Keep a z80 instruction set to not forget a useful instruction and flags affected. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])&lt;br /&gt;
* Use an assembler that has &amp;quot;.echo&amp;quot; directive and use this in the source to count size: (see [[Assemblers|Assemblers]])&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;SomeCodeorData:&lt;br /&gt;
;code or data goes here&lt;br /&gt;
End:&lt;br /&gt;
 .echo &amp;quot;size of the code/data:&amp;quot;&lt;br /&gt;
 .echo End-SomeCodeorData&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])&lt;br /&gt;
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]]) (see wabbitemu)&lt;br /&gt;
&lt;br /&gt;
== Very specific optimizations (hardly practical) ==&lt;br /&gt;
&lt;br /&gt;
=== Table alignment ===&lt;br /&gt;
Use an aligned address on memory such as $8000 (theoretical example) and if you will only use 256 bytes ($8000 to $80FF), to get the next byte use inc l instead of inc hl.&lt;br /&gt;
&lt;br /&gt;
== Crazy, &amp;quot;magick&amp;quot;, hacks and obscure optimization's tricks ==&lt;br /&gt;
&lt;br /&gt;
These are not normally recommend for use because some disturb disassembly and even coders understanding the code.&lt;br /&gt;
&lt;br /&gt;
=== Better else ===&lt;br /&gt;
So you normally have an if-else-endif block like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
jr nz,else    ;the IF&lt;br /&gt;
;some code&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
;some code&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
But here's a crazy trick for when the Else code is a single 2-byte instruction:&lt;br /&gt;
You use the first byte of a 3 byte instruction with no side effects instead of the &amp;quot;jr endif&amp;quot; line!&lt;br /&gt;
So if you had code like this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
jr endif&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You could replace it with this:&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
cp 7&lt;br /&gt;
jr nz,else&lt;br /&gt;
ld a,3&lt;br /&gt;
.db $C2  ;jp nz,xxxx&lt;br /&gt;
else:&lt;br /&gt;
ld a,4&lt;br /&gt;
endif:&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of branching over the ld a,4 instruction, it now executes a jp nz,XXXX instruction where the XXXX is the two bytes of the next instruction. You already know what the flags will be here, so you can make the jump never taken. You can use this to skip the next two bytes of execution! Who needs to branch over it?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This only takes 28 T-states for if. A small saving, but could be useful in tight loops, and saves 2 bytes!&lt;br /&gt;
The only reason not to use this for 1-byte instructions would be code readability and bug safety. Watch those flags!&lt;br /&gt;
&lt;br /&gt;
=== Conditional rst ===&lt;br /&gt;
&lt;br /&gt;
For a smaller conditional rst $38, use jr cc, -1. This will cause a conditional jump to the displacement byte ($FF) which is the rst $38 opcode. &lt;br /&gt;
&lt;br /&gt;
=== DAA trick ===&lt;br /&gt;
&lt;br /&gt;
Normally DAA instruction is used for BCD math but can be used for converting (?) ASCII integer.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
	cp 10&lt;br /&gt;
	ccf&lt;br /&gt;
	adc a, 30h&lt;br /&gt;
	daa&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Related topics ==&lt;br /&gt;
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&amp;amp;t=675 MaxCodez TI-ASM optimization]&lt;br /&gt;
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]&lt;br /&gt;
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]&lt;br /&gt;
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]&lt;br /&gt;
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]&lt;br /&gt;
* [http://www.smspower.org/Development/Z80ProgrammingTechniques SMS Power! dev wiki z80 Techniques]&lt;br /&gt;
&lt;br /&gt;
== Acknowledgements ==&lt;br /&gt;
* fullmetalcoder&lt;br /&gt;
* Galandros&lt;br /&gt;
* Dwedit for sharing in MaxCoderz the &amp;quot;Better else&amp;quot;&lt;br /&gt;
* MaxCoderz participants in assembly optimizing topic (Jim e,CoBB,...)&lt;br /&gt;
* SMS Power wiki&lt;br /&gt;
* Einar Saukas&lt;/div&gt;</summary>
		<author><name>Einar</name></author>	</entry>

	</feed>