https://wikiti.brandonw.net/api.php?action=feedcontributions&user=Fullmetalcoder&feedformat=atomWikiTI - User contributions [en]2024-03-29T02:12:14ZUser contributionsMediaWiki 1.23.5https://wikiti.brandonw.net/index.php?title=Category_talk:83Plus:QuirksCategory talk:83Plus:Quirks2010-07-22T23:16:20Z<p>Fullmetalcoder: Created page with '"The calculator will crash if PC is greater than or equal to C000, provided an even-numbered RAM page is swapped in the upper bank. This is the default (page 80h). This is where …'</p>
<hr />
<div>"The calculator will crash if PC is greater than or equal to C000, provided an even-numbered RAM page is swapped in the upper bank. This is the default (page 80h). This is where the 8kb limit comes from "<br />
<br />
This is not a "quirk" but a relatively well documented hardware constraint known as memory (execution) protection.</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=83Plus:OS:TIOS_Alternatives83Plus:OS:TIOS Alternatives2010-01-21T15:23:14Z<p>Fullmetalcoder: /* List of alternative operating systems */</p>
<hr />
<div>[[Category:83Plus:OS_Information|TIOS Alternatives]]<br />
<br />
== Introduction ==<br />
<br />
The TIOS is the official, standard operating system for the<br />
TI-83+ series of calculators (including the TI-83+ and TI-84+ and the respective Silver<br />
Editions of each). The vast majority of programs for such calculators run on top of the<br />
TIOS, or on top of some subsidiary program that runs from the TIOS, and thereby implicitly<br />
depend on its functionality. However, some people have researched writing alternative<br />
operating system code and sending it to the calculator to replace the TIOS.<br />
Several such alternative operating systems are available on the Web, though most of them are<br />
not yet at a production stage of development.<br />
<br />
== List of alternative operating systems ==<br />
<br />
*[http://michaelv.org/programs/calcs/ceptic.php CEPTIC], by Michael Vincent: a Control and Execute Program for TI Calculators. The current version of CEPTIC only runs on the TI-83+ SE, but can be modified to run on the TI-83+. Assembly source is available, but actually using the OS in its present state is considered non-practical, and the project has been discontinued for various reasons.<br />
*[http://pongos.sourceforge.net/ PongOS], by FloppusMaximus: a simple, proof-of-concept system whose namesake feature is an embedded Pong game. Some other system utilities, mostly inspired by Dan Englender's Calcsys, are also available in PongOS, including a hex editor, memory mover (with flash capability), flash sector eraser, and port monitor. Link support is not provided.<br />
*[[Vera]], by several members of different programming groups: dubbed the "true calc lover's OS". Vera is intended to consist of a very basic kernel which can be easily extended to include desired features. The original Vera project has been abandoned, but it has been picked up again in a different form, and seems to be progressing nicely.<br />
*[http://www.ticalc.org/archives/files/fileinfo/349/34973.html CSX], by Sean McLaughlin: a command-line-based operating system with a screen layout similar to that of the TI-89 calculators. CSX provides a simple filesystem, send and receive of files over a link cable, hex editing of memory, and running of Z80 machine code programs.<br />
*[http://forum.reaktix.com/viewtopic.php?pid=11 Nostalgy], originally by [http://katpro.xiondigital.net/ XDG Kat-Productions], now developed by [http://reaktix.com/ Reaktix Software]: an unofficial project started by XDG Kat-Productions, abandoned when the two main developers became involved in other projects, and later resumed by [[User:Saibot84|Saibot84]]. A pre-alpha working demo is available. Development is still underway, albeit extremely slowly. It currently features a task-switching environment inspired by [http://www.radicalsoft.org/ Radical Software's] TSE, although linking and a file system are not yet implemented.<br />
*[http://lifos.sourceforge.net/ LIFOS], by Peter Marheine: a similar project to Vera, designed to offer minimal functionality (linking, memory management, and machine code execution) in its basic incarnation but meant to be easily extended into a near-seamless infrastructure of various functions. Currently (5-28-07) in early alpha stages. The name comes from the memory allocation system (LIFO OS, or LIFOs).<br />
*[http://www.ticalc.org/archives/files/fileinfo/398/39863.html BAOS], by Erik van 't Wout: Basic Assembly Operating System. (excerpt from the ReadMe:) Being developed "to be a real Operating System for TI-83+ based calculators. It should turn your calculator into a real computer, which can be used for mathematical purposes, but not as main target. Additional functionality should be easy to implement trough the use of libraries."<br />
*[http://www.brandonw.net/calcstuff/OS2 OS2], by Brandon Wilson: OS2 is "the TI-OS done right", a from-the-ground-up re-implementation of the TI-OS, designed to do everything the TI-OS can do and run everything it can, but with fewer restrictions and changes not normally possible due to the TI-OS' structure, such as being able to run BASIC programs directly from the archive. It also supports dual-booting with the TI-OS so a user can continue to use the original TI-OS while more and more is added to OS2. Currently only dual-booting and a basic system monitor are supported.<br />
*[http://code.google.com/p/8xpos/ XOS] by Luc Bruant aka fullmetalcoder, XOS main targets are SE calcs (those with 128kb of RAM). It aims to provide a lot of power to application developer, larger storage capacity whenever possible and a minimal emulation layer to ensure a certain level of backward compatibility of TIOS programs and Apps so as to ease the transition for users.</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Emulators:TilEmEmulators:TilEm2009-11-23T11:44:49Z<p>Fullmetalcoder: /* Brief description */</p>
<hr />
<div>[[Category:Emulators]]<br />
{{stub}}<br />
<br />
== Brief description ==<br />
<br />
TilEm is an emulator of z80 TI calculators. The latest version of the core library is z80-only but significantly more complete and accurate.<br />
<br />
Homepage: http://lpg.ticalc.org/prj_tilem/<br />
<br />
Sourceforge project page : http://sf.net/projects/tilem<br />
<br />
== Features ==<br />
<br />
=== Latest stable version (0.973) ===<br />
* Support operating systems other than Microsoft Windows<br />
* Support all z80 TI calculators except the TI-81, and all known ROM/OS versions<br />
* Virtual linking via all physical and virtual cables except direct USB<br />
* Automatic certificate patching for dumped 73/83+/84+ ROMs<br />
* Emulation of certain "security" features on the 83+<br />
* Flexible keyboard handling, with both key macros and manual make/break support<br />
<br />
=== Upcoming version (SVN trunk) ===<br />
* Support all z80 TI calculators, and all known ROM/OS versions<br />
* Virtual linking via all physical and virtual cables except direct USB<br />
* Automatic certificate patching for dumped 73/83+/84+ ROMs<br />
* Emulation of certain "security" features on the 83+<br />
* Fully cross-platform Qt-based GUI<br />
* Possibility to load/save calculator state (including ROM) at run-time<br />
* Possibility to emulate multiple calculators side by side in a single instance of the GUI<br />
* Possibility to send files to the calculator through drag and drop</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Emulators:TilEmEmulators:TilEm2009-11-23T11:44:27Z<p>Fullmetalcoder: </p>
<hr />
<div>[[Category:Emulators]]<br />
{{stub}}<br />
<br />
== Brief description ==<br />
<br />
TilEm is an emulator of z80 TI calculators. The latest version of the core library is z80-only but significantly more complete and accurate.<br />
<br />
Homepage: http://lpg.ticalc.org/prj_tilem/<br />
Sourceforge project page : http://sf.net/projects/tilem<br />
<br />
== Features ==<br />
<br />
=== Latest stable version (0.973) ===<br />
* Support operating systems other than Microsoft Windows<br />
* Support all z80 TI calculators except the TI-81, and all known ROM/OS versions<br />
* Virtual linking via all physical and virtual cables except direct USB<br />
* Automatic certificate patching for dumped 73/83+/84+ ROMs<br />
* Emulation of certain "security" features on the 83+<br />
* Flexible keyboard handling, with both key macros and manual make/break support<br />
<br />
=== Upcoming version (SVN trunk) ===<br />
* Support all z80 TI calculators, and all known ROM/OS versions<br />
* Virtual linking via all physical and virtual cables except direct USB<br />
* Automatic certificate patching for dumped 73/83+/84+ ROMs<br />
* Emulation of certain "security" features on the 83+<br />
* Fully cross-platform Qt-based GUI<br />
* Possibility to load/save calculator state (including ROM) at run-time<br />
* Possibility to emulate multiple calculators side by side in a single instance of the GUI<br />
* Possibility to send files to the calculator through drag and drop</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-09T11:36:45Z<p>Fullmetalcoder: </p>
<hr />
<div>== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.<br />
<br />
== Registers and Memory ==<br />
Generally good algorithms on z80 use registers in a appropriate form.<br />
It is also a good practise to keep a convention and plan how you are going to use the registers.<br />
<br />
General use of registers:<br />
* a - 8-bit accumulator<br />
* b - counter<br />
<br />
* hl - 16-bit accumulator/pointer of a address memory<br />
* de - pointer of a destination address memory<br />
* bc - 16-bit counter<br />
* ix - index register/save copy of hl/pointer to memory when hl and de are being used<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== General Algorithms ==<br />
<br />
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.<br />
<br />
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.<br />
<br />
A list of things to keep in mind:<br />
* Rework conditionals to be more efficient.<br />
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.<br />
* Get out of the main loop special cases check if they aren't needed there.<br />
* Rearrange program flow<br />
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.<br />
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.<br />
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)<br />
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.<br />
* Document wacky optimizations to understand the code later<br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
Also note that near every trick turn the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.<br />
<br />
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on its use, comments warn about them. Some tricks apply to other cases, but again you have to be careful.<br />
<br />
There are some tricks that are nothing more than the correct use of the available instructions on the z80. Keeping an instruction set summary, help to visualize what you can do during coding.<br />
<br />
=== Optimize size and speed ===<br />
<br />
==== Loading stuff ====<br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld b,$20<br />
ld c,$30<br />
;try this<br />
ld bc,$2030<br />
;or this<br />
ld bc,(b_num * 256) + c_num ;where b_num goes to b register and c_num to c register<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld a,$42<br />
ld (hl),a<br />
;try this<br />
ld (hl),$42<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
xor a<br />
ld (data1),a<br />
ld (data2),a<br />
ld (data3),a<br />
ld (data4),a<br />
ld (data5),a ;if data1 to data5 are one after the other<br />
;try this<br />
ld hl,data1<br />
ld de,data1+1<br />
xor a<br />
ld (hl),a<br />
ld bc,4<br />
ldir<br />
; -> save 3 bytes for every ld (dataX), after passing the initial overhead<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld a,(var)<br />
inc a<br />
ld (var),a<br />
;try this ;Note: if hl is not tied up, use indirection:<br />
ld hl,var<br />
inc (hl)<br />
ld a,(hl) ;if you don't need (hl) in a, delete this line<br />
; -> save 2 bytes and 2 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
==== Math and Logic tricks ====<br />
<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
cp 1<br />
; ><br />
dec a ;changes a!<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
xor %11111111<br />
; ><br />
cpl<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,767<br />
or a ;reset carry so sbc works as a sub<br />
sbc hl,de<br />
;try this<br />
ld de,-767 ;negation of de<br />
add hl,de<br />
; -> 2 bytes and 8 T-states !<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,-767<br />
add hl,de<br />
;try this<br />
dec h ; -256<br />
dec h ; -512<br />
dec h ; -768<br />
inc hl ; -767<br />
;Note that works in many other cases<br />
; -> save 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
srl a<br />
srl a<br />
srl a<br />
;try this<br />
rrca<br />
rrca<br />
rrca<br />
and %00011111<br />
; -> save 1 byte and 5 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
neg<br />
add a,N ;you want to calculate N-A<br />
;Do it this way:<br />
cpl<br />
add a,N+1 ;neg is practically equivalent to cpl \ inc a<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
sla l ; I've actually seen this!<br />
rl h<br />
; ><br />
add hl,hl<br />
; -> save 1 byte and 5 T-states<br />
</nowiki><br />
<br />
==== Conditionals ====<br />
<br />
<nowiki><br />
and 1<br />
cp 1<br />
jr z,foo<br />
; ><br />
and 1 ;and sets zero flag, no need for cp<br />
jr nz,foo<br />
; -> save 2 bytes and 7 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
and 1<br />
cp 1 ;a not needed after this<br />
jr z,foo<br />
; ><br />
rra<br />
jr c,foo<br />
</nowiki><br />
<br />
<nowiki><br />
bit 0,a<br />
call z,foo<br />
; ><br />
rra<br />
call nc,foo<br />
</nowiki><br />
<br />
<nowiki><br />
bit 7,a<br />
jr z,foo<br />
; ><br />
rla<br />
jr nc,foo<br />
</nowiki><br />
<br />
<nowiki><br />
bit 2,a<br />
ret nz<br />
xor a<br />
; ><br />
and %100<br />
ret nz<br />
</nowiki><br />
<br />
==== Others ====<br />
<br />
Calling and returning...<br />
<nowiki><br />
;Instead of<br />
call xxxx<br />
ret<br />
;try this<br />
jp xxxx<br />
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.<br />
; -> save 1 byte and 17 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Never use:<br />
dec B<br />
jr NZ,loop ;I have seen this...<br />
;Use:<br />
djnz loop<br />
; save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
loop:<br />
ld a,2<br />
;code1<br />
ld a,0<br />
;code2<br />
djnz loop<br />
<br />
;try this<br />
ld a,2<br />
loop:<br />
;code1<br />
xor $01 ; the trick is xor logic make a register alternate between two values<br />
;code2<br />
djnz loop<br />
; -> save size and time depending on its use<br />
</nowiki><br />
<br />
<br />
<br />
=== Size vs. Speed ===<br />
<br />
The classical problem of optimization in computer programming, Z80 is no exception.<br />
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM memory.<br />
<br />
==== For the sake of size ====<br />
<br />
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (out of -128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:<br />
<br />
<nowiki><br />
;lots of code (more that 128 bytes worth of code)<br />
somelabel2:<br />
jp somelabel<br />
;less than 128 bytes<br />
jr somelabel2 ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.<br />
</nowiki><br />
<br />
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.<br />
<br />
<br />
<nowiki><br />
;Instead of<br />
dec bc<br />
ld a,b<br />
or c<br />
ret z<br />
;try this<br />
cpi ;increments HL<br />
ret po<br />
; save 1 byte at the cost of 2 T-states<br />
</nowiki><br />
<br />
'''Passing inline data'''<br />
<br />
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).<br />
<br />
<nowiki><br />
Instead of:<br />
ld hl,string<br />
bcall(_vputs)<br />
ret<br />
;Try this:<br />
call Disp<br />
.db "This is some text",0<br />
ret<br />
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.<br />
;It also heavily disturbs disassembly.<br />
Disp:<br />
pop hl<br />
bcall(_vputs)<br />
jp (hl)<br />
; -> save 2 bytes for each use, but 4 bytes of overhead (Disp routine)<br />
</nowiki><br />
<br />
This routine can be expanded to pass the coordinates where the text should appear.<br />
<br />
'''Wasting time to delay'''<br />
<br />
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:<br />
<br />
<nowiki><br />
;Instead of<br />
ld a,KEY_GROUP<br />
out (1),a<br />
nop<br />
nop<br />
in a,(1)<br />
;Try this:<br />
ld a,KEY_GROUP<br />
out (1),a<br />
ld a,(de) ;a doesn't need to be preserved because it will hold what the port has.<br />
in a,(1)<br />
; -> save 1 byte and 1 T-state (well 1 T-state less is almost the same time)<br />
</nowiki><br />
<br />
When you need to delay and cannot afford to alter registers or flags there are still ways to delay that waste less size than nop's :<br />
<nowiki><br />
; 2 bytes, 8 T-states<br />
nop<br />
nop<br />
<br />
; 2 bytes, 12 T-states<br />
inc hl<br />
dec hl<br />
<br />
; 2 bytes, 12 T-states<br />
jr $+2<br />
<br />
; 2 bytes, 21 T-states<br />
push af<br />
pop af<br />
<br />
; 2 bytes, 38 T-states<br />
ex (sp), hl<br />
ex (sp), hl<br />
</nowiki><br />
<br />
If you need a small adjustable delay:<br />
<nowiki><br />
;4 bytes, b*13+8 T-states (variable)<br />
ld b,255 ; initial delay<br />
djnz $ ; do it<br />
;b=0 on exit<br />
</nowiki><br />
<br />
Notes:<br />
* There are many other instructions that you can use<br />
* Beware that not all instructions preserve registers or flags<br />
* For delay between frames of games or other longer delays, you can use the 'halt' instruction if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mechanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).<br />
<br />
==== Unrolling code ====<br />
<br />
'''General Unrolling'''<br />
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.<br />
This means you are wasting memory to gain speed. Most times you are preferring size to speed.<br />
<br />
'''Unroll commands'''<br />
<nowiki><br />
; "Classic" way : ~21 T-states per byte copied<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size<br />
ldir<br />
<br />
; Unrolled : (16 * size + 10) / n -> ~18 T-states per byte copied when unrolling 8 times<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration<br />
loopldi: ;you can use this entry for a call<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
jp pe, loopldi ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size<br />
; ret if this is a subroutine and use the unrolled ldi's with a call.<br />
</nowiki><br />
This unroll of ldi also works with outi and ldr.<br />
<br />
==== Looping with 16 bit counter ====<br />
There are two ways to make loops with a 16bit counter :<br />
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a<br />
<nowiki><br />
ld bc, ...<br />
loop:<br />
; loop body here<br />
<br />
dec bc<br />
ld a, b<br />
or c<br />
jp nz,loop<br />
</nowiki><br />
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)<br />
<nowiki><br />
dec de<br />
ld b, e<br />
inc b<br />
inc d<br />
loop2:<br />
; loop body here<br />
<br />
djnz loop2<br />
dec d<br />
jp nz,loop2<br />
</nowiki><br />
The rationale behind the second method is to reduce the overhead of the "inner" loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. <br />
<br />
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :<br />
<br />
<nowiki><br />
#define inner_counter8(counter16) (((counter16) - 1) & 0xff) + 1<br />
#define outer_counter8(counter16) (((counter16) - 1) >> 8) + 1<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occasion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
;Set parity/overflow (even):<br />
xor a<br />
<br />
;Reset parity/overflow (odd):<br />
sub a<br />
<br />
;Set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
;Reset half carry (hardly ever useful but still...)<br />
or a<br />
<br />
;Set bit 5 of f:<br />
or %00100000<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.<br />
<br />
More advance ways of manipulating flags follow:<br />
<nowiki><br />
;get the zero flag in carry <br />
scf<br />
jr z,$+3<br />
ccf<br />
<br />
;Put carry flag into zero flag.<br />
ccf<br />
sbc a, a<br />
</nowiki><br />
<br />
== Tools of the job ==<br />
<br />
Want to try test your optimization or test new ones? Then you have to check this:<br />
* Keep a z80 instruction set to not forget a useful instruction. (see [[Z80_Instruction_Set|Z80_Instruction_Set]])<br />
* Get a assembler that can echo and use this in the source to count: (see [[Assemblers|Assemblers]])<br />
<nowiki><br />
SomeCodeorData:<br />
;code or data goes here<br />
<br />
End:<br />
<br />
.echo "size of the code/data:"<br />
.echo End-SomeCodeorData<br />
</nowiki><br />
* Get a nice IDE of z80 that counts code ([[IDEs|IDE's]])<br />
* Make use of the counting capabilities of an emulator ([[:Category:Emulators|Emulators]])<br />
<br />
== Related topics ==<br />
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&t=675 MaxCodez TI-ASM optimization]<br />
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]<br />
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]<br />
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]<br />
* [http://shiar.nl/calc/z80/optimize Shiar z80 optimization page]<br />
<br />
== Acknowledgements ==<br />
* fullmetalcoder<br />
* Galandros<br />
* <br />
* <br />
* <br />
* <!-- do not forget to include MaxCoderz users names that participated in the TI-ASM optimizing topic and 'we' included here --></div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_Routines:Optimized:addAtoHLZ80 Routines:Optimized:addAtoHL2009-11-08T10:55:23Z<p>Fullmetalcoder: </p>
<hr />
<div>[[Category:Z80 Routines:Optimized|AddAtoHL]]<br />
[[Category:Z80 Routines|AddAtoHL]]<br />
<br />
This is an optimized addAtoHL. It is a little faster and doesn't need another 16-bit register.<br />
<br />
Also it can be changed to add A to any 16-bit register. The only down side is one extra byte.<br />
<br />
Use it as a subroutine (don't forget the ret) or macro.<br />
<br />
Normal way:<br />
<nowiki><br />
ld d,$00<br />
ld e,a<br />
add hl,de<br />
;4 bytes and 22 clock cycles<br />
</nowiki><br />
<br />
<nowiki><br />
addAtoHL:<br />
add a,l<br />
ld l,a<br />
adc a,h ;^ these two lines<br />
sub l ;v increase h if there is carry<br />
ld h,a<br />
;5 bytes and 20 clock cycles<br />
;but no other 16-bit register messed up<br />
</nowiki><br />
<br />
Thanks to CoBB.<br />
<br />
another alternate way which uses branching and saves 1 T-state on one path (when carrying to h) :<br />
<nowiki><br />
addAtoHL:<br />
add a,l<br />
ld l,a<br />
jr nc, $+3<br />
inc h<br />
;5 bytes, 19/20 clock cycles<br />
</nowiki></div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-06T10:04:01Z<p>Fullmetalcoder: /* For the sake of size */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.<br />
<br />
== Registers and Memory ==<br />
Generally good algorithms on z80 use registers in a appropriate form.<br />
It is also a good practise to keep a convention and plan how you are going to use the registers.<br />
<br />
General use of registers:<br />
* a - 8-bit accumulator<br />
* b - counter<br />
<br />
* hl - 16-bit accumulator/pointer of a address memory<br />
* de - pointer of a destination address memory<br />
* bc - 16-bit counter<br />
* ix - index register/save copy of hl/pointer to memory when hl and de are being used<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== General Algorithms ==<br />
<br />
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.<br />
<br />
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.<br />
<br />
A list of things to keep in mind:<br />
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.<br />
* Get out of the main loop special cases check if they aren't needed there.<br />
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.<br />
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.<br />
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)<br />
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.<br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
=== Optimize size and speed ===<br />
<br />
==== Loading stuff ====<br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld b,$20<br />
ld c,$30<br />
;try this<br />
ld bc,$2030<br />
;or this<br />
ld bc,(b_num * 256) + c_num<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
xor a<br />
ld (data1),a<br />
ld (data2),a<br />
ld (data3),a<br />
ld (data4),a<br />
ld (data5),a ;if data1 to data5 are one after the other<br />
;try this<br />
ld hl,data1<br />
ld de,data1+1<br />
xor a<br />
ld (hl),a<br />
ld bc,4<br />
ldir<br />
; -> save 3 bytes for every ld (dataX),a<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld a,(var)<br />
inc a<br />
ld (var),a<br />
;try this ;if hl is not tied up and all you do is check flags, use indirection:<br />
ld hl,var<br />
inc (hl)<br />
ld a,(hl)<br />
; -> save 2 bytes and 2 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
==== Math tricks ====<br />
<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,767<br />
or a ;reset carry<br />
sbc hl,de<br />
;try this<br />
ld de,-767<br />
add hl,de<br />
; -> 2 bytes and 8 T-states !<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,-767<br />
add hl,de<br />
;try this<br />
dec h ; -256<br />
dec h ; -512<br />
dec h ; -768<br />
inc hl ; -767<br />
; -> save 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
srl a<br />
srl a<br />
srl a<br />
;try this<br />
rrca<br />
rrca<br />
rrca<br />
and %00011111<br />
; -> save 1 byte and 5 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
neg<br />
add a,N ;you want to calculate N-A<br />
;doing it this way:<br />
cpl<br />
add a,N+1 ;This is because neg is practically equivalent to cpl \ inc a.<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
loop:<br />
ld a,2<br />
;code1<br />
ld a,0<br />
;code2<br />
djnz loop<br />
<br />
;try this<br />
ld a,2<br />
loop:<br />
;code1<br />
xor $01 ; the trick is xor logic make a register alternate between two values<br />
;code2<br />
djnz loop<br />
; -> save size and time depending on its use<br />
</nowiki><br />
<br />
==== Others ====<br />
<br />
Calling and returning...<br />
<nowiki><br />
;Instead of<br />
call xxxx<br />
ret<br />
;try this<br />
jp xxxx<br />
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.<br />
; -> save 1 byte and 17 T-states<br />
</nowiki><br />
<br />
=== Size vs. Speed ===<br />
The classical problem of optimization in computer programming, Z80 is no exception.<br />
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM.<br />
Speed can also be needed...<br />
<br />
==== For the sake of size ====<br />
<br />
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (-128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:<br />
<br />
<nowiki><br />
;lots of code (more that 128 bytes worth of code)<br />
somelabel2:<br />
jp somelabel<br />
;less than 128 bytes<br />
jr somelabel2 ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.<br />
</nowiki><br />
<br />
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.<br />
<br />
<br />
'''Passing inline data'''<br />
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).<br />
<br />
<nowiki><br />
call Disp<br />
.db "This is some text",0<br />
ret<br />
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.<br />
;It also heavily disturbs disassembly.<br />
Disp:<br />
pop hl<br />
bcall(_vputs)<br />
jp (hl)<br />
</nowiki><br />
<br />
This routine can be expanded to pass the coordinates where the text should appear.<br />
<br />
'''Wasting time to delay'''<br />
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:<br />
<br />
<nowiki><br />
;Instead of<br />
ld a,KEY_GROUP<br />
out (1),a<br />
nop<br />
nop<br />
in a,(1)<br />
;Try this:<br />
ld a,KEY_GROUP<br />
out (1),a<br />
ld a,(de) ;a doesn't need to be preserved because it will hold what the port has.<br />
in a,(1)<br />
; -> save 1 byte and 1 T-state (well 1 T-state less is almost the same time)<br />
</nowiki><br />
<br />
When you need to delay and cannot afford to alter egister or flags there are still ways to delay that waste less size than nops :<br />
<nowiki><br />
; 2 bytes, 8 T-states<br />
nop<br />
nop<br />
<br />
; 2 bytes, 12 T-states<br />
inc hl<br />
dec hl<br />
<br />
; 2 bytes, 12 T-states<br />
jr $+2<br />
<br />
; 2 bytes, 21 T-states<br />
push af<br />
pop af<br />
<br />
; 2 bytes, 38 T-states<br />
ex (sp), hl<br />
ex (sp), hl<br />
</nowiki><br />
<br />
Notes:<br />
- there are many other instructions that you can use<br />
- beware that not all instructions preserve registers or flags<br />
- for delay between frames of games, you can use the halt instructions if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mecanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).<br />
<br />
==== Unrolling code ====<br />
<br />
'''General Unrolling'''<br />
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.<br />
This means you are wasting memory to gain speed. Most times you are preferring size to speed.<br />
<br />
'''Unroll commands'''<br />
<nowiki><br />
; "Classic" way : ~21 T-states per byte copied<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size<br />
ldir<br />
<br />
; Unrolled : (16 * size + 10) / n -> ~18 T-states per byte copied when unrolling 8 times<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration<br />
loopldi: ;you can use this entry for a call<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
jp pe, loopldi ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size<br />
; ret if this is a subroutine and use the unrolled ldi's with a call.<br />
</nowiki><br />
This unroll of ldi also works with outi and ldr.<br />
<br />
==== Looping with 16 bit counter ====<br />
There are two ways to make loops with a 16bit counter :<br />
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a<br />
<nowiki><br />
ld bc, ...<br />
loop:<br />
; loop body here<br />
<br />
dec bc<br />
ld a, b<br />
or c<br />
jp nz,loop<br />
</nowiki><br />
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)<br />
<nowiki><br />
dec de<br />
ld b, e<br />
inc b<br />
inc d<br />
loop2:<br />
; loop body here<br />
<br />
djnz loop2<br />
dec d<br />
jp nz,loop2<br />
</nowiki><br />
The rationale behind the second method is to reduce the overhead of the "inner" loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. <br />
<br />
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :<br />
<br />
<nowiki><br />
#define inner_counter8(counter16) (((counter16) - 1) & 0xff) + 1<br />
#define outer_counter8(counter16) (((counter16) - 1) >> 8) + 1<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occasion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
;Set parity/overflow (even):<br />
xor a<br />
<br />
;Reset parity/overflow (odd):<br />
sub a<br />
<br />
;Set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
;Reset half carry (hardly ever useful but still...)<br />
or a<br />
<br />
;Set bit 5 of f:<br />
or %00100000<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.<br />
<br />
More advance ways of manipulating flags follow:<br />
<nowiki><br />
;get the zero flag in carry <br />
scf<br />
jr z,$+3<br />
ccf<br />
<br />
;Put carry flag into zero flag.<br />
ccf<br />
sbc a, a<br />
</nowiki><br />
<br />
== Related topics ==<br />
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&t=675 MaxCodez TI-ASM optimization]<br />
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]<br />
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]<br />
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]<br />
<br />
== Acknowledgements ==<br />
* fullmetalcoder<br />
* Galandros</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-06T09:58:25Z<p>Fullmetalcoder: /* For the sake of size */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.<br />
<br />
== Registers and Memory ==<br />
Generally good algorithms on z80 use registers in a appropriate form.<br />
It is also a good practise to keep a convention and plan how you are going to use the registers.<br />
<br />
General use of registers:<br />
* a - 8-bit accumulator<br />
* b - counter<br />
<br />
* hl - 16-bit accumulator/pointer of a address memory<br />
* de - pointer of a destination address memory<br />
* bc - 16-bit counter<br />
* ix - index register/save copy of hl/pointer to memory when hl and de are being used<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== General Algorithms ==<br />
<br />
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.<br />
<br />
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.<br />
<br />
A list of things to keep in mind:<br />
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.<br />
* Get out of the main loop special cases check if they aren't needed there.<br />
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.<br />
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.<br />
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)<br />
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.<br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
=== Optimize size and speed ===<br />
<br />
==== Loading stuff ====<br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld b,$20<br />
ld c,$30<br />
;try this<br />
ld bc,$2030<br />
;or this<br />
ld bc,(b_num * 256) + c_num<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
xor a<br />
ld (data1),a<br />
ld (data2),a<br />
ld (data3),a<br />
ld (data4),a<br />
ld (data5),a ;if data1 to data5 are one after the other<br />
;try this<br />
ld hl,data1<br />
ld de,data1+1<br />
xor a<br />
ld (hl),a<br />
ld bc,4<br />
ldir<br />
; -> save 3 bytes for every ld (dataX),a<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld a,(var)<br />
inc a<br />
ld (var),a<br />
;try this ;if hl is not tied up and all you do is check flags, use indirection:<br />
ld hl,var<br />
inc (hl)<br />
ld a,(hl)<br />
; -> save 2 bytes and 2 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
==== Math tricks ====<br />
<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,767<br />
or a ;reset carry<br />
sbc hl,de<br />
;try this<br />
ld de,-767<br />
add hl,de<br />
; -> 2 bytes and 8 T-states !<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,-767<br />
add hl,de<br />
;try this<br />
dec h ; -256<br />
dec h ; -512<br />
dec h ; -768<br />
inc hl ; -767<br />
; -> save 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
srl a<br />
srl a<br />
srl a<br />
;try this<br />
rrca<br />
rrca<br />
rrca<br />
and %00011111<br />
; -> save 1 byte and 5 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
neg<br />
add a,N ;you want to calculate N-A<br />
;doing it this way:<br />
cpl<br />
add a,N+1 ;This is because neg is practically equivalent to cpl \ inc a.<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
loop:<br />
ld a,2<br />
;code1<br />
ld a,0<br />
;code2<br />
djnz loop<br />
<br />
;try this<br />
ld a,2<br />
loop:<br />
;code1<br />
xor $01 ; the trick is xor logic make a register alternate between two values<br />
;code2<br />
djnz loop<br />
; -> save size and time depending on its use<br />
</nowiki><br />
<br />
==== Others ====<br />
<br />
Calling and returning...<br />
<nowiki><br />
;Instead of<br />
call xxxx<br />
ret<br />
;try this<br />
jp xxxx<br />
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.<br />
; -> save 1 byte and 17 T-states<br />
</nowiki><br />
<br />
=== Size vs. Speed ===<br />
The classical problem of optimization in computer programming, Z80 is no exception.<br />
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM.<br />
Speed can also be needed...<br />
<br />
==== For the sake of size ====<br />
<br />
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (-128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:<br />
<br />
<nowiki><br />
;lots of code (more that 128 bytes worth of code)<br />
somelabel2:<br />
jp somelabel<br />
;less than 128 bytes<br />
jr somelabel2 ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.<br />
</nowiki><br />
<br />
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.<br />
<br />
<br />
'''Passing inline data'''<br />
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).<br />
<br />
<nowiki><br />
call Disp<br />
.db "This is some text",0<br />
ret<br />
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.<br />
;It also heavily disturbs disassembly.<br />
Disp:<br />
pop hl<br />
bcall(_vputs)<br />
jp (hl)<br />
</nowiki><br />
<br />
This routine can be expanded to pass the coordinates where the text should appear.<br />
<br />
'''Wasting time to delay'''<br />
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:<br />
<br />
<nowiki><br />
;Instead of<br />
ld a,KEY_GROUP<br />
out (1),a<br />
nop<br />
nop<br />
in a,(1)<br />
;Try this:<br />
ld a,KEY_GROUP<br />
out (1),a<br />
ld a,(de) ;a doesn't need to be preserved because it will hold what the port has.<br />
in a,(1)<br />
; -> save 1 byte and 1 T-state (well 1 T-state less is almost the same time)<br />
</nowiki><br />
<br />
NOP's take 4 T-states for byte while other instructions like ld a,(de) take 7 T-states for byte.<br />
Other instructions to delay to keep in mind:<br />
<nowiki><br />
; 2 bytes, 12 T-states<br />
inc hl<br />
dec hl<br />
<br />
; 2 bytes, 21 T-states<br />
push af<br />
pop af<br />
<br />
; 2 bytes, 38 T-states<br />
ex (sp), hl<br />
ex (sp), hl<br />
</nowiki><br />
<br />
Notes:<br />
- there are many other instructions that you can use<br />
- beware that not all instructions preserve registers or flags<br />
- for delay between frames of games, you can use the halt instructions if there are interrupts enabled. It make the calculator enter low power mode until an interrupt is triggered. To fine-tune the effect of this delay mecanism you can alter interrupt mask and interrupt time speed beforehand (and possibly restore their values afterwards).<br />
<br />
==== Unrolling code ====<br />
<br />
'''General Unrolling'''<br />
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.<br />
This means you are wasting memory to gain speed. Most times you are preferring size to speed.<br />
<br />
'''Unroll commands'''<br />
<nowiki><br />
; "Classic" way : ~21 T-states per byte copied<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size<br />
ldir<br />
<br />
; Unrolled : (16 * size + 10) / n -> ~18 T-states per byte copied when unrolling 8 times<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration<br />
loopldi: ;you can use this entry for a call<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
jp pe, loopldi ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size<br />
; ret if this is a subroutine and use the unrolled ldi's with a call.<br />
</nowiki><br />
This unroll of ldi also works with outi and ldr.<br />
<br />
==== Looping with 16 bit counter ====<br />
There are two ways to make loops with a 16bit counter :<br />
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a<br />
<nowiki><br />
ld bc, ...<br />
loop:<br />
; loop body here<br />
<br />
dec bc<br />
ld a, b<br />
or c<br />
jp nz,loop<br />
</nowiki><br />
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)<br />
<nowiki><br />
dec de<br />
ld b, e<br />
inc b<br />
inc d<br />
loop2:<br />
; loop body here<br />
<br />
djnz loop2<br />
dec d<br />
jp nz,loop2<br />
</nowiki><br />
The rationale behind the second method is to reduce the overhead of the "inner" loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. <br />
<br />
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :<br />
<br />
<nowiki><br />
#define inner_counter8(counter16) (((counter16) - 1) & 0xff) + 1<br />
#define outer_counter8(counter16) (((counter16) - 1) >> 8) + 1<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occasion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
;Set parity/overflow (even):<br />
xor a<br />
<br />
;Reset parity/overflow (odd):<br />
sub a<br />
<br />
;Set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
;Reset half carry (hardly ever useful but still...)<br />
or a<br />
<br />
;Set bit 5 of f:<br />
or %00100000<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.<br />
<br />
More advance ways of manipulating flags follow:<br />
<nowiki><br />
;get the zero flag in carry <br />
scf<br />
jr z,$+3<br />
ccf<br />
<br />
;Put carry flag into zero flag.<br />
ccf<br />
sbc a, a<br />
</nowiki><br />
<br />
== Related topics ==<br />
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&t=675 MaxCodez TI-ASM optimization]<br />
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]<br />
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]<br />
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]<br />
<br />
== Acknowledgements ==<br />
* fullmetalcoder<br />
* Galandros</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-06T09:54:22Z<p>Fullmetalcoder: /* Formatting fixes */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.<br />
<br />
== Registers and Memory ==<br />
Generally good algorithms on z80 use registers in a appropriate form.<br />
It is also a good practise to keep a convention and plan how you are going to use the registers.<br />
<br />
General use of registers:<br />
* a - 8-bit accumulator<br />
* b - counter<br />
<br />
* hl - 16-bit accumulator/pointer of a address memory<br />
* de - pointer of a destination address memory<br />
* bc - 16-bit counter<br />
* ix - index register/save copy of hl/pointer to memory when hl and de are being used<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== General Algorithms ==<br />
<br />
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.<br />
<br />
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.<br />
<br />
A list of things to keep in mind:<br />
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.<br />
* Get out of the main loop special cases check if they aren't needed there.<br />
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.<br />
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.<br />
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)<br />
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.<br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
=== Optimize size and speed ===<br />
<br />
==== Loading stuff ====<br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld b,$20<br />
ld c,$30<br />
;try this<br />
ld bc,$2030<br />
;or this<br />
ld bc,(b_num * 256) + c_num<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
xor a<br />
ld (data1),a<br />
ld (data2),a<br />
ld (data3),a<br />
ld (data4),a<br />
ld (data5),a ;if data1 to data5 are one after the other<br />
;try this<br />
ld hl,data1<br />
ld de,data1+1<br />
xor a<br />
ld (hl),a<br />
ld bc,4<br />
ldir<br />
; -> save 3 bytes for every ld (dataX),a<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld a,(var)<br />
inc a<br />
ld (var),a<br />
;try this ;if hl is not tied up and all you do is check flags, use indirection:<br />
ld hl,var<br />
inc (hl)<br />
ld a,(hl)<br />
; -> save 2 bytes and 2 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
==== Math tricks ====<br />
<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,767<br />
or a ;reset carry<br />
sbc hl,de<br />
;try this<br />
ld de,-767<br />
add hl,de<br />
; -> 2 bytes and 8 T-states !<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,-767<br />
add hl,de<br />
;try this<br />
dec h ; -256<br />
dec h ; -512<br />
dec h ; -768<br />
inc hl ; -767<br />
; -> save 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
srl a<br />
srl a<br />
srl a<br />
;try this<br />
rrca<br />
rrca<br />
rrca<br />
and %00011111<br />
; -> save 1 byte and 5 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
neg<br />
add a,N ;you want to calculate N-A<br />
;doing it this way:<br />
cpl<br />
add a,N+1 ;This is because neg is practically equivalent to cpl \ inc a.<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
loop:<br />
ld a,2<br />
;code1<br />
ld a,0<br />
;code2<br />
djnz loop<br />
<br />
;try this<br />
ld a,2<br />
loop:<br />
;code1<br />
xor $01 ; the trick is xor logic make a register alternate between two values<br />
;code2<br />
djnz loop<br />
; -> save size and time depending on its use<br />
</nowiki><br />
<br />
==== Others ====<br />
<br />
Calling and returning...<br />
<nowiki><br />
;Instead of<br />
call xxxx<br />
ret<br />
;try this<br />
jp xxxx<br />
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.<br />
; -> save 1 byte and 17 T-states<br />
</nowiki><br />
<br />
=== Size vs. Speed ===<br />
The classical problem of optimization in computer programming, Z80 is no exception.<br />
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM.<br />
Speed can also be needed...<br />
<br />
==== For the sake of size ====<br />
<br />
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (-128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:<br />
<br />
<nowiki><br />
;lots of code (more that 128 bytes worth of code)<br />
somelabel2:<br />
jp somelabel<br />
;less than 128 bytes<br />
jr somelabel2 ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.<br />
</nowiki><br />
<br />
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.<br />
<br />
<br />
'''Passing inline data'''<br />
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).<br />
<br />
<nowiki><br />
call Disp<br />
.db "This is some text",0<br />
ret<br />
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.<br />
;It also heavily disturbs disassembly.<br />
Disp:<br />
pop hl<br />
bcall(_vputs)<br />
jp (hl)<br />
</nowiki><br />
<br />
This routine can be expanded to pass the coordinates where the text should appear.<br />
<br />
'''Wasting time to delay'''<br />
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:<br />
<br />
<nowiki><br />
;Instead of<br />
ld a,KEY_GROUP<br />
out (1),a<br />
nop<br />
nop<br />
in a,(1)<br />
;Try this:<br />
ld a,KEY_GROUP<br />
out (1),a<br />
ld a,(de) ;a doesn't need to be preserved because it will hold what the port has.<br />
in a,(1)<br />
; -> save 1 byte and 1 T-state (well 1 T-state less is almost the same time)<br />
</nowiki><br />
<br />
NOP's take 4 T-states for byte while other instructions like ld a,(de) take 7 T-states for byte. Other instructions to delay to take in mind:<br />
<nowiki><br />
push af<br />
pop af<br />
; ~10 T-states for byte<br />
inc hl<br />
dec hl<br />
; 6 T-states for byte<br />
</nowiki><br />
<br />
Notes:<br />
- there are many other instructions that you can use<br />
- beware that not all instructions preserve registers or flags<br />
- for delay between frames of games, you can use the halt instructions if there are interrupts enabled. This optimization trick has to be well used...<br />
<br />
==== Unrolling code ====<br />
<br />
'''General Unrolling'''<br />
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.<br />
This means you are wasting memory to gain speed. Most times you are preferring size to speed.<br />
<br />
'''Unroll commands'''<br />
<nowiki><br />
; "Classic" way : ~21 T-states per byte copied<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size<br />
ldir<br />
<br />
; Unrolled : (16 * size + 10) / n -> ~18 T-states per byte copied when unrolling 8 times<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration<br />
loopldi: ;you can use this entry for a call<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
jp pe, loopldi ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size<br />
; ret if this is a subroutine and use the unrolled ldi's with a call.<br />
</nowiki><br />
This unroll of ldi also works with outi and ldr.<br />
<br />
==== Looping with 16 bit counter ====<br />
There are two ways to make loops with a 16bit counter :<br />
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a<br />
<nowiki><br />
ld bc, ...<br />
loop:<br />
; loop body here<br />
<br />
dec bc<br />
ld a, b<br />
or c<br />
jp nz,loop<br />
</nowiki><br />
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)<br />
<nowiki><br />
dec de<br />
ld b, e<br />
inc b<br />
inc d<br />
loop2:<br />
; loop body here<br />
<br />
djnz loop2<br />
dec d<br />
jp nz,loop2<br />
</nowiki><br />
The rationale behind the second method is to reduce the overhead of the "inner" loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. <br />
<br />
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :<br />
<br />
<nowiki><br />
#define inner_counter8(counter16) (((counter16) - 1) & 0xff) + 1<br />
#define outer_counter8(counter16) (((counter16) - 1) >> 8) + 1<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occasion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
;Set parity/overflow (even):<br />
xor a<br />
<br />
;Reset parity/overflow (odd):<br />
sub a<br />
<br />
;Set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
;Reset half carry (hardly ever useful but still...)<br />
or a<br />
<br />
;Set bit 5 of f:<br />
or %00100000<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.<br />
<br />
More advance ways of manipulating flags follow:<br />
<nowiki><br />
;get the zero flag in carry <br />
scf<br />
jr z,$+3<br />
ccf<br />
<br />
;Put carry flag into zero flag.<br />
ccf<br />
sbc a, a<br />
</nowiki><br />
<br />
== Related topics ==<br />
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&t=675 MaxCodez TI-ASM optimization]<br />
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]<br />
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]<br />
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]<br />
<br />
== Acknowledgements ==<br />
* fullmetalcoder<br />
* Galandros</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-06T09:52:01Z<p>Fullmetalcoder: /* Others */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.<br />
<br />
== Registers and Memory ==<br />
Generally good algorithms on z80 use registers in a appropriate form.<br />
It is also a good practise to keep a convention and plan how you are going to use the registers.<br />
<br />
General use of registers:<br />
* a - 8-bit accumulator<br />
* b - counter<br />
<br />
* hl - 16-bit accumulator/pointer of a address memory<br />
* de - pointer of a destination address memory<br />
* bc - 16-bit counter<br />
* ix - index register/save copy of hl/pointer to memory when hl and de are being used<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== General Algorithms ==<br />
<br />
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.<br />
<br />
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.<br />
<br />
A list of things to keep in mind:<br />
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.<br />
* Get out of the main loop special cases check if they aren't needed there.<br />
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.<br />
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.<br />
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)<br />
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.<br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
=== Optimize size and speed ===<br />
<br />
==== Loading stuff ====<br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld b,$20<br />
ld c,$30<br />
;try this<br />
ld bc,$2030<br />
;or this<br />
ld bc,(b_num * 256) + c_num<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
xor a<br />
ld (data1),a<br />
ld (data2),a<br />
ld (data3),a<br />
ld (data4),a<br />
ld (data5),a ;if data1 to data5 are one after the other<br />
;try this<br />
ld hl,data1<br />
ld de,data1+1<br />
xor a<br />
ld (hl),a<br />
ld bc,4<br />
ldir<br />
; -> save 3 bytes for every ld (dataX),a<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld a,(var)<br />
inc a<br />
ld (var),a<br />
;try this ;if hl is not tied up and all you do is check flags, use indirection:<br />
ld hl,var<br />
inc (hl)<br />
ld a,(hl)<br />
; -> save 2 bytes and 2 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
==== Math tricks ====<br />
<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,767<br />
or a ;reset carry<br />
sbc hl,de<br />
;try this<br />
ld de,-767<br />
add hl,de<br />
; -> 2 bytes and 8 T-states !<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,-767<br />
add hl,de<br />
;try this<br />
dec h ; -256<br />
dec h ; -512<br />
dec h ; -768<br />
inc hl ; -767<br />
; -> save 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
srl a<br />
srl a<br />
srl a<br />
;try this<br />
rrca<br />
rrca<br />
rrca<br />
and %00011111<br />
; -> save 1 byte and 5 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
neg<br />
add a,N ;you want to calculate N-A<br />
;doing it this way:<br />
cpl<br />
add a,N+1 ;This is because neg is practically equivalent to cpl \ inc a.<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
loop:<br />
ld a,2<br />
;code1<br />
ld a,0<br />
;code2<br />
djnz loop<br />
<br />
;try this<br />
ld a,2<br />
loop:<br />
;code1<br />
xor $01 ; the trick is xor logic make a register alternate between two values<br />
;code2<br />
djnz loop<br />
; -> save size and time depending on its use<br />
</nowiki><br />
<br />
==== Others ====<br />
<br />
Calling and returning...<br />
<nowiki><br />
;Instead of<br />
call xxxx<br />
ret<br />
;try this<br />
jp xxxx<br />
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.<br />
; -> save 1 byte and 17 T-states<br />
</nowiki><br />
<br />
=== Size vs. Speed ===<br />
The classical problem of optimization in computer programming, Z80 is no exception.<br />
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM.<br />
Speed can also be needed...<br />
<br />
==== For the sake of size =====<br />
<br />
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (-128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:<br />
<br />
<nowiki><br />
;lots of code (more that 128 bytes worth of code)<br />
somelabel2:<br />
jp somelabel<br />
;less than 128 bytes<br />
jr somelabel2 ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.<br />
</nowiki><br />
<br />
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.<br />
<br />
<br />
'''Passing inline data'''<br />
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).<br />
<br />
<nowiki><br />
call Disp<br />
.db "This is some text",0<br />
ret<br />
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.<br />
;It also heavily disturbs disassembly.<br />
Disp:<br />
pop hl<br />
bcall(_vputs)<br />
jp (hl)<br />
</nowiki><br />
<br />
This routine can be expanded to pass the coordinates where the text should appear.<br />
<br />
'''Wasting time to delay'''<br />
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:<br />
<br />
<nowiki><br />
;Instead of<br />
ld a,KEY_GROUP<br />
out (1),a<br />
nop<br />
nop<br />
in a,(1)<br />
;Try this:<br />
ld a,KEY_GROUP<br />
out (1),a<br />
ld a,(de) ;a doesn't need to be preserved because it will hold what the port has.<br />
in a,(1)<br />
; -> save 1 byte and 1 T-state (well 1 T-state less is almost the same time)<br />
</nowiki><br />
<br />
NOP's take 4 T-states for byte while other instructions like ld a,(de) take 7 T-states for byte. Other instructions to delay to take in mind:<br />
<nowiki><br />
push af<br />
pop af<br />
; ~10 T-states for byte<br />
inc hl<br />
dec hl<br />
; 6 T-states for byte<br />
<nowiki><br />
<br />
Notes:<br />
- there are many other instructions that you can use<br />
- beware that not all instructions preserve registers or flags<br />
- for delay between frames of games, you can use the halt instructions if there are interrupts enabled. This optimization trick has to be well used...<br />
<br />
==== Unrolling code ====<br />
<br />
'''General Unrolling'''<br />
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.<br />
This means you are wasting memory to gain speed. Most times you are preferring size to speed.<br />
<br />
'''Unroll commands'''<br />
<nowiki><br />
; "Classic" way : ~21 T-states per byte copied<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size<br />
ldir<br />
<br />
; Unrolled : (16 * size + 10) / n -> ~18 T-states per byte copied when unrolling 8 times<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration<br />
loopldi: ;you can use this entry for a call<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
jp pe, loopldi ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size<br />
; ret if this is a subroutine and use the unrolled ldi's with a call.<br />
</nowiki><br />
This unroll of ldi also works with outi and ldr.<br />
<br />
==== Looping with 16 bit counter ====<br />
There are two ways to make loops with a 16bit counter :<br />
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a<br />
<nowiki><br />
ld bc, ...<br />
loop:<br />
; loop body here<br />
<br />
dec bc<br />
ld a, b<br />
or c<br />
jp nz,loop<br />
</nowiki><br />
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)<br />
<nowiki><br />
dec de<br />
ld b, e<br />
inc b<br />
inc d<br />
loop2:<br />
; loop body here<br />
<br />
djnz loop2<br />
dec d<br />
jp nz,loop2<br />
</nowiki><br />
The rationale behind the second method is to reduce the overhead of the "inner" loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. <br />
<br />
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :<br />
<br />
<nowiki><br />
#define inner_counter8(counter16) (((counter16) - 1) & 0xff) + 1<br />
#define outer_counter8(counter16) (((counter16) - 1) >> 8) + 1<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occasion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
;Set parity/overflow (even):<br />
xor a<br />
<br />
;Reset parity/overflow (odd):<br />
sub a<br />
<br />
;Set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
;Reset half carry (hardly ever useful but still...)<br />
or a<br />
<br />
;Set bit 5 of f:<br />
or %00100000<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.<br />
<br />
More advance ways of manipulating flags follow:<br />
<nowiki><br />
;get the zero flag in carry <br />
scf<br />
jr z,$+3<br />
ccf<br />
<br />
;Put carry flag into zero flag.<br />
ccf<br />
sbc a, a<br />
</nowiki><br />
<br />
== Related topics ==<br />
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&t=675 MaxCodez TI-ASM optimization]<br />
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]<br />
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]<br />
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]<br />
<br />
== Acknowledgements ==<br />
* fullmetalcoder<br />
* Galandros</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-06T09:10:24Z<p>Fullmetalcoder: /* Setting flags */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: consuming graphics/data programs and graphics code of mapping, grayscale and 3D graphics.<br />
<br />
== Registers and Memory ==<br />
Generally good algorithms on z80 use registers in a appropriate form.<br />
It is also a good practise to keep a convention and plan how you are going to use the registers.<br />
<br />
General use of registers:<br />
* a - 8-bit accumulator<br />
* b - counter<br />
<br />
* hl - 16-bit accumulator/pointer of a address memory<br />
* de - pointer of a destination address memory<br />
* bc - 16-bit counter<br />
* ix - index register/save copy of hl/pointer to memory when hl and de are being used<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== General Algorithms ==<br />
<br />
Registers and Memory use is very important in writing concise and fast z80 code. Then comes the general optimization.<br />
<br />
First, try to optimize the more used code in subroutines and large loops. Finding the bottleneck and solving it, is enough to many programs.<br />
<br />
A list of things to keep in mind:<br />
* Make sure the most common checks come first. Or said in other way, the more special and rare cases check in last.<br />
* Get out of the main loop special cases check if they aren't needed there.<br />
* When possible, if you can afford to have a bigger overhead and get code out of the main loop do it.<br />
* When your code seems that even with optimization won't be efficient enough, try another approach or algorithm. Search other algorithms in Wikipedia, for instance.<br />
* Rewriting code from scratch can bring new ideas (use in desperate situations because of all work needed to write it)<br />
* Remember almost all times is better to leave optimization to the end. Optimization can bring too early headaches with crashes and debugging. And because ASM is very fast and sometimes even smaller than higher level languages, it may not be needed further optimization.<br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peep-hole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
=== Optimize size and speed ===<br />
<br />
==== Loading stuff ====<br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld b,$20<br />
ld c,$30<br />
;try this<br />
ld bc,$2030<br />
;or this<br />
ld bc,(b_num * 256) + c_num<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
xor a<br />
ld (data1),a<br />
ld (data2),a<br />
ld (data3),a<br />
ld (data4),a<br />
ld (data5),a ;if data1 to data5 are one after the other<br />
;try this<br />
ld hl,data1<br />
ld de,data1+1<br />
xor a<br />
ld (hl),a<br />
ld bc,4<br />
ldir<br />
; -> save 3 bytes for every ld (dataX),a<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld a,(var)<br />
inc a<br />
ld (var),a<br />
;try this ;if hl is not tied up and all you do is check flags, use indirection:<br />
ld hl,var<br />
inc (hl)<br />
ld a,(hl)<br />
; -> save 2 bytes and 2 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
==== Math tricks ====<br />
<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,767<br />
or a ;reset carry<br />
sbc hl,de<br />
;try this<br />
ld de,-767<br />
add hl,de<br />
; -> 2 bytes and 8 T-states !<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld de,-767<br />
add hl,de<br />
;try this<br />
dec h ; -256<br />
dec h ; -512<br />
dec h ; -768<br />
inc hl ; -767<br />
; -> save 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
srl a<br />
srl a<br />
srl a<br />
;try this<br />
rrca<br />
rrca<br />
rrca<br />
and %00011111<br />
; -> save 1 byte and 5 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
neg<br />
add a,N ;you want to calculate N-A<br />
;doing it this way:<br />
cpl<br />
add a,N+1 ;This is because neg is practically equivalent to cpl \ inc a.<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
loop:<br />
ld a,2<br />
;code1<br />
ld a,0<br />
;code2<br />
djnz loop<br />
<br />
;try this<br />
ld a,2<br />
loop:<br />
;code1<br />
xor $01 ; the trick is xor logic make a register alternate between two values<br />
;code2<br />
djnz loop<br />
; -> save size and time depending on its use<br />
</nowiki><br />
<br />
==== Others ====<br />
<br />
Calling and returning...<br />
<nowiki><br />
;Instead of<br />
call xxxx<br />
ret<br />
;try this<br />
jp xxxx<br />
;only do this if the pushed pc to stack is not passed to the call. Example: some kind of inline vputs.<br />
; -> save 1 byte and 7 T-states<br />
</nowiki><br />
<br />
=== Size vs. Speed ===<br />
The classical problem of optimization in computer programming, Z80 is no exception.<br />
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM.<br />
Speed can also be needed...<br />
<br />
==== For the sake of size =====<br />
<br />
* Use relative jumps (jr label) whenever possible. When relative jump is out of reach (-128 to 127 bytes) and there is a jp near, do a relative jump to the absolute one. Example:<br />
<br />
<nowiki><br />
;lots of code (more that 128 bytes worth of code)<br />
somelabel2:<br />
jp somelabel<br />
;less than 128 bytes<br />
jr somelabel2 ;instead of a absolute jump directly to somelabel, jump to a jump to somelabel.<br />
</nowiki><br />
<br />
* Relative jumps are 2 bytes and absolute jumps 3. In terms of speed jp is faster when a jump occurs (10 T-states) and jr is faster when it doesn't occur.<br />
<br />
<br />
'''Passing inline data'''<br />
When you call, the pc + 3 (after the call) is pushed. You can pop it and use as a pointer to data. A very nifty use is with strings. To return, pass the data and jp (hl).<br />
<br />
<nowiki><br />
call Disp<br />
.db "This is some text",0<br />
ret<br />
;Not a speed optimization, but it eliminates 2-byte pointers, since it just uses the call's return address.<br />
;It also heavily disturbs disassembly.<br />
Disp:<br />
pop hl<br />
bcall(_vputs)<br />
jp (hl)<br />
</nowiki><br />
<br />
This routine can be expanded to pass the coordinates where the text should appear.<br />
<br />
'''Wasting time to delay'''<br />
There are those funny times that you need some delay between operations like reads/writes to ports '''''and there is nothing useful to do'''''. And because nop's are not very size friendly, think of other slower but smaller instructions. Example:<br />
<br />
<nowiki><br />
;Instead of<br />
ld a,KEY_GROUP<br />
out (1),a<br />
nop<br />
nop<br />
in a,(1)<br />
;Try this:<br />
ld a,KEY_GROUP<br />
out (1),a<br />
ld a,(de) ;a doesn't need to be preserved because it will hold what the port has.<br />
in a,(1)<br />
; -> save 1 byte and 1 T-state (well 1 T-state less is almost the same time)<br />
</nowiki><br />
<br />
NOP's take 4 T-states for byte while other instructions like ld a,(de) take 7 T-states for byte. Other instructions to delay to take in mind:<br />
<nowiki><br />
push af<br />
pop af<br />
; ~10 T-states for byte<br />
inc hl<br />
dec hl<br />
; 6 T-states for byte<br />
<nowiki><br />
<br />
Notes:<br />
- there are many other instructions that you can use<br />
- beware that not all instructions preserve registers or flags<br />
- for delay between frames of games, you can use the halt instructions if there are interrupts enabled. This optimization trick has to be well used...<br />
<br />
==== Unrolling code ====<br />
<br />
'''General Unrolling'''<br />
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.<br />
This means you are wasting memory to gain speed. Most times you are preferring size to speed.<br />
<br />
'''Unroll commands'''<br />
<nowiki><br />
; "Classic" way : ~21 T-states per byte copied<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size<br />
ldir<br />
<br />
; Unrolled : (16 * size + 10) / n -> ~18 T-states per byte copied when unrolling 8 times<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration<br />
loopldi: ;you can use this entry for a call<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
jp pe, loopldi ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size<br />
; ret if this is a subroutine and use the unrolled ldi's with a call.<br />
</nowiki><br />
This unroll of ldi also works with outi and ldr.<br />
<br />
==== Looping with 16 bit counter ====<br />
There are two ways to make loops with a 16bit counter :<br />
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a<br />
<nowiki><br />
ld bc, ...<br />
loop:<br />
; loop body here<br />
<br />
dec bc<br />
ld a, b<br />
or c<br />
jp nz,loop<br />
</nowiki><br />
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)<br />
<nowiki><br />
dec de<br />
ld b, e<br />
inc b<br />
inc d<br />
loop2:<br />
; loop body here<br />
<br />
djnz loop2<br />
dec d<br />
jp nz,loop2<br />
</nowiki><br />
The rationale behind the second method is to reduce the overhead of the "inner" loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. <br />
<br />
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :<br />
<br />
<nowiki><br />
#define inner_counter8(counter16) (((counter16) - 1) & 0xff) + 1<br />
#define outer_counter8(counter16) (((counter16) - 1) >> 8) + 1<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occasion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
;Set parity/overflow (even):<br />
xor a<br />
<br />
;Reset parity/overflow (odd):<br />
sub a<br />
<br />
;Set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
;Reset half carry (hardly ever useful but still...)<br />
or a<br />
<br />
;Set bit 5 of f:<br />
or %00100000<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.<br />
<br />
More advance ways of manipulating flags follow:<br />
<nowiki><br />
;get the zero flag in carry <br />
scf<br />
jr z,$+3<br />
ccf<br />
<br />
;Put carry flag into zero flag.<br />
ccf<br />
sbc a, a<br />
</nowiki><br />
<br />
== Related topics ==<br />
* [http://www.junemann.nl/maxcoderz/viewtopic.php?f=5&t=675 MaxCodez TI-ASM optimization]<br />
* ticalc archives: [http://www.ticalc.org/archives/files/fileinfo/108/10821.html 1] [http://www.ticalc.org/archives/files/fileinfo/285/28502.html 2]<br />
* [http://www.ballyalley.com/ml/z80_docs/z80_docs.html Balley Alley Z80 Machine Language Documentation]<br />
* [http://map.grauw.nl/articles/fast_loops.php Fast loops in MSX Assembly Page]<br />
<br />
== Acknowledgements ==<br />
* fullmetalcoder<br />
* Galandros</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:Square_rootZ80 Routines:Math:Square root2009-11-05T17:14:53Z<p>Fullmetalcoder: </p>
<hr />
<div>[[Category:Z80 Routines:Math|Square root]]<br />
[[Category:Z80 Routines|Square root]]<br />
<br />
==Size Optimization==<br />
This version is size optimized, it compares every perfect square against HL until a square that is larger is found. Obviously slower, but does get the job done in only 12 bytes.<br />
<nowiki>;-------------------------------<br />
;Square Root<br />
;Inputs:<br />
;HL = number to be square rooted<br />
;Outputs:<br />
;A = square root<br />
<br />
sqrt:<br />
ld a,$ff<br />
ld de,1<br />
sqrtloop:<br />
inc a<br />
dec e<br />
dec de<br />
add hl,de<br />
jr c,sqrtloop<br />
ret </nowiki><br />
<br />
<br />
==Speed Optimization==<br />
This version uses the high school method of finding a square root and so it is much faster, running at about ~850 tstates. Unfortunately it requires 180 bytes and is quite obfuscated.<br />
<nowiki>;-------------------------------<br />
;Square Root<br />
;Inputs:<br />
;DE = number to be square rooted<br />
;Outputs:<br />
;A = square root<br />
<br />
sqrt:<br />
xor a<br />
ld h,a<br />
ld l,a<br />
ld b,a<br />
rl d<br />
rl l<br />
rl d<br />
rl l<br />
ld c,1<br />
sbc hl,bc<br />
jp c,$+3+2+1<br />
sbc hl,bc<br />
inc a<br />
add hl,bc<br />
add a,a<br />
rl d<br />
rl l<br />
rl d<br />
rl l<br />
ld c,a<br />
scf<br />
rl c<br />
sbc hl,bc<br />
jp c,$+3+2+1<br />
sbc hl,bc<br />
inc a<br />
add hl,bc<br />
add a,a<br />
rl d<br />
rl l<br />
rl d<br />
rl l<br />
ld c,a<br />
scf<br />
rl c<br />
sbc hl,bc<br />
jp c,$+3+2+1<br />
sbc hl,bc<br />
inc a<br />
add hl,bc<br />
add a,a<br />
rl d<br />
rl l<br />
rl d<br />
rl l<br />
ld c,a<br />
scf<br />
rl c<br />
sbc hl,bc<br />
jp c,$+3+2+1<br />
sbc hl,bc<br />
inc a<br />
add hl,bc<br />
add a,a<br />
rl e<br />
adc hl,hl<br />
rl e<br />
adc hl,hl<br />
ld c,a<br />
scf<br />
rl c<br />
sbc hl,bc<br />
jp c,$+3+2+1<br />
sbc hl,bc<br />
inc a<br />
add hl,bc<br />
add a,a<br />
rl e<br />
adc hl,hl<br />
rl e<br />
adc hl,hl<br />
ld c,a<br />
scf<br />
rl c<br />
sbc hl,bc<br />
jp c,$+3+2+1<br />
sbc hl,bc<br />
inc a<br />
add hl,bc<br />
add a,a<br />
rl e<br />
adc hl,hl<br />
rl e<br />
adc hl,hl<br />
ld c,a<br />
scf<br />
rl c<br />
sbc hl,bc<br />
jp c,$+3+2+1<br />
sbc hl,bc<br />
inc a<br />
add hl,bc<br />
add a,a<br />
rl e<br />
adc hl,hl<br />
rl e<br />
adc hl,hl<br />
ld c,a<br />
scf<br />
rl c<br />
rl b<br />
sbc hl,bc<br />
jp c,$+3+2+1<br />
sbc hl,bc<br />
inc a<br />
add hl,bc<br />
ret</nowiki><br />
<br />
<br />
==Balanced Optimization==<br />
This version is a balance between speed and size. It also uses the high school method and runs under 1200 tstates. It only costs 41 bytes.<br />
<nowiki>;-------------------------------<br />
;Square Root<br />
;Inputs:<br />
;DE = number to be square rooted<br />
;Outputs:<br />
;A = square root<br />
<br />
Sqrt:<br />
ld hl,0<br />
ld c,l<br />
ld b,h<br />
ld a,8<br />
Sqrtloop:<br />
sla e<br />
rl d<br />
adc hl,hl<br />
sla e<br />
rl d<br />
adc hl,hl<br />
scf ;Can be optimised<br />
rl c ;with SL1 instruction<br />
rl b<br />
sbc hl,bc<br />
jr nc,Sqrtaddbit<br />
add hl,bc<br />
dec c<br />
Sqrtaddbit:<br />
inc c<br />
res 0,c<br />
dec a<br />
jr nz,Sqrtloop<br />
ld a,c<br />
rr b<br />
rra<br />
ret</nowiki><br />
<br />
== Presumably the best ==<br />
<br />
This code was found on z80 bits and has the advantage of being both faster than all three versions above and smaller than the last two (it runs in under 720 T-states (under 640 if fully unrolled) and takes a mere 29 bytes). On the other hand it takes a somewhat unconventionnal input... It computes the square root of the 16bit number formed by la and places the result in d.<br />
<nowiki><br />
sqrt_la:<br />
ld de, 0040h ; 40h appends "01" to D<br />
ld h, d<br />
<br />
ld b, 7<br />
<br />
; need to clear the carry beforehand<br />
or a<br />
<br />
_loop:<br />
sbc hl, de<br />
jr nc, $+3<br />
add hl, de<br />
ccf<br />
rl d<br />
rla<br />
adc hl, hl<br />
rla<br />
adc hl, hl<br />
<br />
djnz _loop<br />
<br />
sbc hl, de ; optimised last iteration<br />
ccf<br />
rl d<br />
<br />
ret<br />
</nowiki><br />
<br />
<br />
==Other Options==<br />
A binary search of a square table would yield much better best case scenarios and the worst case scenarios would be similar to the high school method. However this would also require 512 byte table making it significantly larger than the other routines. Of course the table could also serve as a rapid squaring method.<br />
<br />
== Credits and Contributions ==<br />
* '''James Montelongo'''<br />
* '''Milos "baze" Bazelides''' (or possibly one of the contributor of [http://baze.au.com/misc/z80bits.html z80bits])</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Category_talk:Z80_RoutinesCategory talk:Z80 Routines2009-11-05T17:00:28Z<p>Fullmetalcoder: /* Contribute */</p>
<hr />
<div>Something's wrong with the redirect here, it's not showing the subcategories! --[[User:Dwedit|Dwedit]] 10:31, 28 Mar 2005 (PST)<br />
<br />
Fixed. There's a mediawiki issue with redirects to categories apparently. So the links now go directly to the category pages. --[[User:Dan Englender|Dan Englender]] 10:46, 28 Mar 2005 (PST)<br />
<br />
== Contribute ==<br />
<br />
It is predicted by me:<br />
* NewLine<br />
* HomeUp<br />
* DispHL (done)<br />
* DispA (done)<br />
* getpixel (done)<br />
* plot 8x8 sprites<br />
* plot 16*16 sprites<br />
* tilemappers of jim_e and dwedit<br />
* flip horizontally a byte (calcmaniac84 optimized version) (done)<br />
* apps safe puts<br />
* math: multiplications, divisions (done)<br />
* a to BC signed<br />
* linking routines<br />
* decompression routines (and compression of course)<br />
* handle appsvars<br />
* archive/unarchive safely vars (programs, appsvars, etc.)<br />
* GetCSC (it was done)<br />
* Others I don't remember or I am digging from forums<br />
<br />
''' And I will add compatible routines with TI-85/86, TI-83 regular and all others I can<br />
<br />
I will normalize the categories "Routines" and "Z80 Routines" to a single one: "Z80 Routines" because there is the remote possibility of 68k/C routines.<br />
EDIT: done. Sorry if someone on Internet directed to those pages. :( I only realized after it was done. [[User:Galandros|Galandros]] 19:57, 25 October 2009 (UTC)<br />
<br />
EDIT2: Why not open a part to spasm macros, too? Where exactly? [[User:Galandros|Galandros]] 19:57, 25 October 2009 (UTC)<br />
EDITs: added more things to come [[User:Galandros|Galandros]] 22:11, 26 October 2009 (UTC)<br />
<br />
Took care of most multiplication/division routines [[User:Fullmetalcoder|Fullmetalcoder]] 16:47, 5 November 2009 (UTC)<br />
<br />
Filled in the getPixel page as well. [[User:Fullmetalcoder|Fullmetalcoder]] 16:59, 5 November 2009 (UTC)</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Category_talk:Z80_RoutinesCategory talk:Z80 Routines2009-11-05T16:59:58Z<p>Fullmetalcoder: /* Contribute */</p>
<hr />
<div>Something's wrong with the redirect here, it's not showing the subcategories! --[[User:Dwedit|Dwedit]] 10:31, 28 Mar 2005 (PST)<br />
<br />
Fixed. There's a mediawiki issue with redirects to categories apparently. So the links now go directly to the category pages. --[[User:Dan Englender|Dan Englender]] 10:46, 28 Mar 2005 (PST)<br />
<br />
== Contribute ==<br />
<br />
It is predicted by me:<br />
* NewLine<br />
* HomeUp<br />
* DispHL (done)<br />
* DispA (done)<br />
* getpixel (done)<br />
* plot 8x8 sprites<br />
* plot 16*16 sprites<br />
* tilemappers of jim_e and dwedit<br />
* flip horizontally a byte (calcmaniac84 optimized version) (done)<br />
* apps safe puts<br />
* math: multiplications, divisions (done)<br />
* a to BC signed<br />
* linking routines<br />
* decompression routines (and compression of course)<br />
* handle appsvars<br />
* archive/unarchive safely vars (programs, appsvars, etc.)<br />
* GetCSC (it was done)<br />
* Others I don't remember or I am digging from forums<br />
<br />
''' And I will add compatible routines with TI-85/86, TI-83 regular and all others I can<br />
<br />
I will normalize the categories "Routines" and "Z80 Routines" to a single one: "Z80 Routines" because there is the remote possibility of 68k/C routines.<br />
EDIT: done. Sorry if someone on Internet directed to those pages. :( I only realized after it was done. [[User:Galandros|Galandros]] 19:57, 25 October 2009 (UTC)<br />
<br />
EDIT2: Why not open a part to spasm macros, too? Where exactly? [[User:Galandros|Galandros]] 19:57, 25 October 2009 (UTC)<br />
EDITs: added more things to come [[User:Galandros|Galandros]] 22:11, 26 October 2009 (UTC)<br />
<br />
Took care of most multiplication/division routines [[User:Fullmetalcoder|Fullmetalcoder]] 16:47, 5 November 2009 (UTC)<br />
Filled in the getPixel page as well. [[User:Fullmetalcoder|Fullmetalcoder]] 16:59, 5 November 2009 (UTC)</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_Routines:Graphic:getPixelZ80 Routines:Graphic:getPixel2009-11-05T16:58:41Z<p>Fullmetalcoder: </p>
<hr />
<div>[[Category:Z80 Routines:Graphic|GetPixel]]<br />
[[Category:Z80 Routines|GetPixel]]<br />
<br />
The '''getPixel''' routine is a utility that simplifies pixel manipulation in the graph buffer.<br />
<br />
== Code ==<br />
<nowiki><br />
; brief : utility for pixel manipulation<br />
; input : a -> x coord, l -> y coord<br />
; output : hl -> address in graph buffer, a -> pixel mask<br />
; destroys : b, de<br />
getPixel:<br />
ld h, 0<br />
ld d, h<br />
ld e, l<br />
<br />
add hl, hl<br />
add hl, de<br />
add hl, hl<br />
add hl, hl<br />
<br />
ld e, a<br />
srl e<br />
srl e<br />
srl e<br />
add hl, de<br />
<br />
ld de, PlotSScreen ; it might be a good idea to have buffer indirection here, i.e : ld de, (buffer_addr)<br />
add hl, de<br />
<br />
and 7<br />
ld b, a<br />
ld a, $80<br />
ret z<br />
<br />
rrca<br />
djnz $-1<br />
<br />
ret<br />
</nowiki><br />
<br />
== Example usage ==<br />
<nowiki><br />
; brief : set (darkens) a pixel in the graph buffer<br />
; input : a -> x coord, l -> y coord<br />
; output : none<br />
; destroys : a, b, de, hl<br />
setPixel:<br />
call getPixel<br />
or (hl)<br />
ld (hl), a<br />
ret<br />
<br />
; brief : reset (lighten) a pixel in the graph buffer<br />
; input : a -> x coord, l -> y coord<br />
; output : none<br />
; destroys : a, b, de, hl<br />
resetPixel:<br />
call getPixel<br />
cpl<br />
and (hl)<br />
ld (hl), a<br />
ret<br />
<br />
; brief : flip (invert) a pixel in the graph buffer<br />
; input : a -> x coord, l -> y coord<br />
; output : none<br />
; destroys : a, b, de, hl<br />
flipPixel:<br />
call getPixel<br />
xor (hl)<br />
ld (hl), a<br />
ret<br />
</nowiki><br />
<br />
== Comments ==<br />
* Don't use this to plot sprites! Use putSprite<br />
* Not that usually used but can be used to get hl pointing to the wanted place in graph buffer</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_Routines:Graphic:getPixelZ80 Routines:Graphic:getPixel2009-11-05T16:58:21Z<p>Fullmetalcoder: filled</p>
<hr />
<div>[[Category:Z80 Routines:Graphic|GetPixel]<br />
[[Category:Z80 Routines|GetPixel]]<br />
<br />
The '''getPixel''' routine is a utility that simplifies pixel manipulation in the graph buffer.<br />
<br />
== Code ==<br />
<nowiki><br />
; brief : utility for pixel manipulation<br />
; input : a -> x coord, l -> y coord<br />
; output : hl -> address in graph buffer, a -> pixel mask<br />
; destroys : b, de<br />
getPixel:<br />
ld h, 0<br />
ld d, h<br />
ld e, l<br />
<br />
add hl, hl<br />
add hl, de<br />
add hl, hl<br />
add hl, hl<br />
<br />
ld e, a<br />
srl e<br />
srl e<br />
srl e<br />
add hl, de<br />
<br />
ld de, PlotSScreen ; it might be a good idea to have buffer indirection here, i.e : ld de, (buffer_addr)<br />
add hl, de<br />
<br />
and 7<br />
ld b, a<br />
ld a, $80<br />
ret z<br />
<br />
rrca<br />
djnz $-1<br />
<br />
ret<br />
</nowiki><br />
<br />
== Example usage ==<br />
<nowiki><br />
; brief : set (darkens) a pixel in the graph buffer<br />
; input : a -> x coord, l -> y coord<br />
; output : none<br />
; destroys : a, b, de, hl<br />
setPixel:<br />
call getPixel<br />
or (hl)<br />
ld (hl), a<br />
ret<br />
<br />
; brief : reset (lighten) a pixel in the graph buffer<br />
; input : a -> x coord, l -> y coord<br />
; output : none<br />
; destroys : a, b, de, hl<br />
resetPixel:<br />
call getPixel<br />
cpl<br />
and (hl)<br />
ld (hl), a<br />
ret<br />
<br />
; brief : flip (invert) a pixel in the graph buffer<br />
; input : a -> x coord, l -> y coord<br />
; output : none<br />
; destroys : a, b, de, hl<br />
flipPixel:<br />
call getPixel<br />
xor (hl)<br />
ld (hl), a<br />
ret<br />
</nowiki><br />
<br />
== Comments ==<br />
* Don't use this to plot sprites! Use putSprite<br />
* Not that usually used but can be used to get hl pointing to the wanted place in graph buffer</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:DivisionZ80 Routines:Math:Division2009-11-05T16:49:57Z<p>Fullmetalcoder: added categorization</p>
<hr />
<div>[[Category:Z80 Routines:Math|Division]]<br />
[[Category:Z80 Routines|Division]]<br />
<br />
== Introduction ==<br />
<br />
All these routines use the restoring divison algorithm, adapted to the z80 architecture to maximize speed.<br />
They can easily be unrolled to gain some speed.<br />
<br />
== 8/8 division ==<br />
<br />
The following routine multiplies d by e and places the quotient in d and the remainder in a<br />
<br />
<nowiki><br />
div_d_e:<br />
xor a<br />
ld b, 8<br />
<br />
_loop:<br />
sla d<br />
rla<br />
cp e<br />
jr c, $+4<br />
sub e<br />
inc d<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 16/8 division ==<br />
<br />
The following routine multiplies hl by c and places the quotient in hl and the remainder in a<br />
<br />
<nowiki><br />
div_hl_c:<br />
xor a<br />
ld b, 16<br />
<br />
_loop:<br />
add hl, hl<br />
rla<br />
cp c<br />
jr c, $+4<br />
sub c<br />
inc l<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 16/16 division ==<br />
<br />
The following routine divides ac by de and places the quotient in ac and the remainder in hl<br />
<br />
<nowiki><br />
div_ac_de:<br />
ld hl, 0<br />
ld b, 16<br />
<br />
_loop:<br />
sll c<br />
rla<br />
adc hl, hl<br />
sbc hl, de<br />
jr nc, $+4<br />
add hl, de<br />
dec c<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 24/8 division ==<br />
<br />
The following routine divides ehl by d and places the quotient in ehl and the remainder in a<br />
<br />
<nowiki><br />
div_ehl_d:<br />
xor a<br />
ld b, 24<br />
<br />
_loop:<br />
add hl, hl<br />
rl e<br />
rla<br />
cp d<br />
jr c, $+4<br />
sub d<br />
inc l<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 32/8 division ==<br />
<br />
The following routine divides dehl by c and places the quotient in dehl and the remainder in a<br />
<br />
<nowiki><br />
div_dehl_c:<br />
xor a<br />
ld b, 32<br />
<br />
_loop:<br />
add hl, hl<br />
rl e<br />
rl d<br />
rla<br />
cp c<br />
jr c, $+4<br />
sub c<br />
inc l<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki></div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:MultiplicationZ80 Routines:Math:Multiplication2009-11-05T16:49:35Z<p>Fullmetalcoder: added categorization</p>
<hr />
<div>[[Category:Z80 Routines:Math|Multiplication]]<br />
[[Category:Z80 Routines|Multiplication]]<br />
<br />
== Introduction ==<br />
<br />
All these routines use the restoring multiplication algorithm, adapted to the z80 architecture to maximize speed.<br />
They can easily be unrolled to gain some speed.<br />
<br />
== 8*8 multiplication ==<br />
<br />
The following routine multiplies h by e and places the result in hl<br />
<br />
<nowiki><br />
mult_h_e<br />
ld l, 0<br />
ld d, l<br />
<br />
sla h ; optimised 1st iteration<br />
jr nc, $+3<br />
ld l, e<br />
<br />
ld b, 7<br />
_loop:<br />
add hl, hl <br />
jr nc, $+3<br />
add hl, de<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 16*8 multiplication ==<br />
<br />
The following routine multiplies de by a and places the result in ahl (which means a is the most significant byte of the product, l the least significant and h the intermediate one...)<br />
<br />
<nowiki><br />
mult_a_de<br />
ld c, 0<br />
ld h, c<br />
ld l, h<br />
<br />
add a, a ; optimised 1st iteration<br />
jr nc, $+4<br />
ld h,d<br />
ld l,e<br />
<br />
ld b, 7<br />
_loop:<br />
add hl, hl<br />
rla<br />
jr nc, $+4<br />
add hl, de<br />
adc a, c ; yes this is actually adc a, 0 but since c is free we set it to zero and so we can save 1 byte and up to 3 T-states per iteration<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 16*16 multiplication ==<br />
<br />
The following routine multiplies bc by de and places the result in dehl.<br />
<br />
<nowiki><br />
mult_de_bc<br />
ld h, 0<br />
ld l, h<br />
<br />
sla e ; optimised 1st iteration<br />
rl d<br />
jr nc, $+4<br />
ld h, b<br />
ld l, c<br />
<br />
ld a, 15<br />
_loop:<br />
add hl, hl<br />
rl e<br />
rl d<br />
jr nc, $+6<br />
add hl, bc<br />
jr nc, $+3<br />
inc de<br />
<br />
dec a<br />
jr nz, _loop<br />
<br />
ret<br />
</nowiki></div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Category_talk:Z80_RoutinesCategory talk:Z80 Routines2009-11-05T16:47:47Z<p>Fullmetalcoder: /* Contribute */</p>
<hr />
<div>Something's wrong with the redirect here, it's not showing the subcategories! --[[User:Dwedit|Dwedit]] 10:31, 28 Mar 2005 (PST)<br />
<br />
Fixed. There's a mediawiki issue with redirects to categories apparently. So the links now go directly to the category pages. --[[User:Dan Englender|Dan Englender]] 10:46, 28 Mar 2005 (PST)<br />
<br />
== Contribute ==<br />
<br />
It is predicted by me:<br />
* NewLine<br />
* HomeUp<br />
* DispHL (done)<br />
* DispA (done)<br />
* getpixel<br />
* plot 8x8 sprites<br />
* plot 16*16 sprites<br />
* tilemappers of jim_e and dwedit<br />
* flip horizontally a byte (calcmaniac84 optimized version) (done)<br />
* apps safe puts<br />
* math: multiplications, divisions<br />
* a to BC signed<br />
* linking routines<br />
* decompression routines (and compression of course)<br />
* handle appsvars<br />
* archive/unarchive safely vars (programs, appsvars, etc.)<br />
* GetCSC (it was done)<br />
* Others I don't remember or I am digging from forums<br />
<br />
''' And I will add compatible routines with TI-85/86, TI-83 regular and all others I can<br />
<br />
I will normalize the categories "Routines" and "Z80 Routines" to a single one: "Z80 Routines" because there is the remote possibility of 68k/C routines.<br />
EDIT: done. Sorry if someone on Internet directed to those pages. :( I only realized after it was done. [[User:Galandros|Galandros]] 19:57, 25 October 2009 (UTC)<br />
<br />
EDIT2: Why not open a part to spasm macros, too? Where exactly? [[User:Galandros|Galandros]] 19:57, 25 October 2009 (UTC)<br />
EDITs: added more things to come [[User:Galandros|Galandros]] 22:11, 26 October 2009 (UTC)<br />
<br />
Took care of most multiplication/division routines [[User:Fullmetalcoder|Fullmetalcoder]] 16:47, 5 November 2009 (UTC)</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:DivisionZ80 Routines:Math:Division2009-11-05T16:45:06Z<p>Fullmetalcoder: creation</p>
<hr />
<div>== Introduction ==<br />
<br />
All these routines use the restoring divison algorithm, adapted to the z80 architecture to maximize speed.<br />
They can easily be unrolled to gain some speed.<br />
<br />
== 8/8 division ==<br />
<br />
The following routine multiplies d by e and places the quotient in d and the remainder in a<br />
<br />
<nowiki><br />
div_d_e:<br />
xor a<br />
ld b, 8<br />
<br />
_loop:<br />
sla d<br />
rla<br />
cp e<br />
jr c, $+4<br />
sub e<br />
inc d<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 16/8 division ==<br />
<br />
The following routine multiplies hl by c and places the quotient in hl and the remainder in a<br />
<br />
<nowiki><br />
div_hl_c:<br />
xor a<br />
ld b, 16<br />
<br />
_loop:<br />
add hl, hl<br />
rla<br />
cp c<br />
jr c, $+4<br />
sub c<br />
inc l<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 16/16 division ==<br />
<br />
The following routine divides ac by de and places the quotient in ac and the remainder in hl<br />
<br />
<nowiki><br />
div_ac_de:<br />
ld hl, 0<br />
ld b, 16<br />
<br />
_loop:<br />
sll c<br />
rla<br />
adc hl, hl<br />
sbc hl, de<br />
jr nc, $+4<br />
add hl, de<br />
dec c<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 24/8 division ==<br />
<br />
The following routine divides ehl by d and places the quotient in ehl and the remainder in a<br />
<br />
<nowiki><br />
div_ehl_d:<br />
xor a<br />
ld b, 24<br />
<br />
_loop:<br />
add hl, hl<br />
rl e<br />
rla<br />
cp d<br />
jr c, $+4<br />
sub d<br />
inc l<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 32/8 division ==<br />
<br />
The following routine divides dehl by c and places the quotient in dehl and the remainder in a<br />
<br />
<nowiki><br />
div_dehl_c:<br />
xor a<br />
ld b, 32<br />
<br />
_loop:<br />
add hl, hl<br />
rl e<br />
rl d<br />
rla<br />
cp c<br />
jr c, $+4<br />
sub c<br />
inc l<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki></div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:MultiplicationZ80 Routines:Math:Multiplication2009-11-05T16:34:14Z<p>Fullmetalcoder: creation</p>
<hr />
<div>== Introduction ==<br />
<br />
All these routines use the restoring multiplication algorithm, adapted to the z80 architecture to maximize speed.<br />
They can easily be unrolled to gain some speed.<br />
<br />
== 8*8 multiplication ==<br />
<br />
The following routine multiplies h by e and places the result in hl<br />
<br />
<nowiki><br />
mult_h_e<br />
ld l, 0<br />
ld d, l<br />
<br />
sla h ; optimised 1st iteration<br />
jr nc, $+3<br />
ld l, e<br />
<br />
ld b, 7<br />
_loop:<br />
add hl, hl <br />
jr nc, $+3<br />
add hl, de<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 16*8 multiplication ==<br />
<br />
The following routine multiplies de by a and places the result in ahl (which means a is the most significant byte of the product, l the least significant and h the intermediate one...)<br />
<br />
<nowiki><br />
mult_a_de<br />
ld c, 0<br />
ld h, c<br />
ld l, h<br />
<br />
add a, a ; optimised 1st iteration<br />
jr nc, $+4<br />
ld h,d<br />
ld l,e<br />
<br />
ld b, 7<br />
_loop:<br />
add hl, hl<br />
rla<br />
jr nc, $+4<br />
add hl, de<br />
adc a, c ; yes this is actually adc a, 0 but since c is free we set it to zero and so we can save 1 byte and up to 3 T-states per iteration<br />
<br />
djnz _loop<br />
<br />
ret<br />
</nowiki><br />
<br />
== 16*16 multiplication ==<br />
<br />
The following routine multiplies bc by de and places the result in dehl.<br />
<br />
<nowiki><br />
mult_de_bc<br />
ld h, 0<br />
ld l, h<br />
<br />
sla e ; optimised 1st iteration<br />
rl d<br />
jr nc, $+4<br />
ld h, b<br />
ld l, c<br />
<br />
ld a, 15<br />
_loop:<br />
add hl, hl<br />
rl e<br />
rl d<br />
jr nc, $+6<br />
add hl, bc<br />
jr nc, $+3<br />
inc de<br />
<br />
dec a<br />
jr nz, _loop<br />
<br />
ret<br />
</nowiki></div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=User:FullmetalcoderUser:Fullmetalcoder2009-11-05T16:17:52Z<p>Fullmetalcoder: some background</p>
<hr />
<div>* Happy owner of a TI-84+ bought in fall 2004 (so it has 128kb of RAM yay!).<br />
* Started coding z80 in 2004 (main project was a 2D engine working with B&W, 4lvl or 8lvl grayscale sources have been lost...)<br />
* Got bored of assembly after a year and got back to C++ (main projects : [http://edyuk.org Edyuk] and [http://qcodeedit.edyuk.org QCodeEdit])<br />
* Felt like toying with assembly again once the OS signing keys were cracked and started writing an [http://code.google.com/p/8xpos OS] for fun.<br />
* Casual lurker on forums, including but not limited to : [http://qtcentre.org QtCentre] [http://qtfr.org QtFr] [http://unitedti.org UnitedTI]<br />
* Reachable by mail at non.deterministic.finite.organism@gmail.com</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Programming_APPS_vs._Ram_ProgramsProgramming APPS vs. Ram Programs2009-11-05T16:06:39Z<p>Fullmetalcoder: /* APPS */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
This is intended for calculators with Flash memory and so APPS, this means, the TI-8x family and TI-73.<br />
<br />
Programming APPS is very much like ram programs with some extra things to keep in mind.<br />
<br />
== APPS ==<br />
* Code starts in $4000<br />
* An APPS need an apps header<br />
* No write back or self modifying code<br />
* Most static data passed to bcalls (strings for instance) has to be copied to RAM first<br />
* Page calls, jumps or copy data</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-04T17:11:40Z<p>Fullmetalcoder: /* Size vs. Speed */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: mapping, grayscale and 3D graphics.<br />
<br />
== Registers and Memory ==<br />
General algorithm improvements and correct use of registers.<br />
<br />
General use of registers:<br />
* a accumulator<br />
* b counter<br />
<br />
* hl 16-bit accumulator/pointer to memory<br />
* de pointer of destination in memory<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peephole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
=== Optimize size and speed ===<br />
<br />
==== Loading stuff ====<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld b,$20<br />
ld c,$30<br />
;try this<br />
ld bc,$2030<br />
;or this<br />
ld bc,(b_num * 256) + c_num<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
xor a<br />
ld (data1),a<br />
ld (data2),a<br />
ld (data3),a<br />
ld (data4),a<br />
ld (data5),a ;if data1 to data5 are one after the other<br />
;try this<br />
ld hl,data1<br />
ld de,data1+1<br />
xor a<br />
ld (hl),a<br />
ld bc,4<br />
ldir<br />
; -> save 3 bytes for every ld (dataX),a<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld a,(var)<br />
inc a<br />
ld (var),a<br />
;try this ;if hl is not tied up and all you do is check flags, use indirection:<br />
ld hl,var<br />
inc (hl)<br />
ld a,(hl)<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
==== Math tricks ====<br />
<br />
<nowiki><br />
;Instead of<br />
ld de,-767<br />
add hl,de<br />
;try this<br />
dec h ; -256<br />
dec h ; -512<br />
dec h ; -768<br />
inc hl ; -767<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
srl a<br />
srl a<br />
srl a<br />
;try this<br />
rrca<br />
rrca<br />
rrca<br />
and %00011111<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
loop:<br />
ld a,2<br />
;code1<br />
ld a,0<br />
;code2<br />
<br />
;try this<br />
ld a,2<br />
loop:<br />
;code1<br />
xor $01<br />
;code2<br />
</nowiki><br />
<br />
==== Looping with 16 bit counter ====<br />
There are two ways to make loops with a 16bit counter :<br />
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a<br />
<nowiki><br />
ld bc, ...<br />
loop:<br />
; loop body here<br />
<br />
dec bc<br />
ld a, b<br />
or c<br />
jp nz,loop<br />
</nowiki><br />
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)<br />
<nowiki><br />
dec de<br />
ld b, e<br />
inc b<br />
inc d<br />
loop2:<br />
; loop body here<br />
<br />
djnz loop2<br />
dec d<br />
jp nz,loop2<br />
</nowiki><br />
The rationale behind the second method is to reduce the overhead of the "inner" loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. <br />
<br />
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :<br />
<br />
<nowiki><br />
#define inner_counter8(counter16) (((counter16) - 1) & 0xff) + 1<br />
#define outer_counter8(counter16) (((counter16) - 1) >> 8) + 1<br />
</nowiki><br />
<br />
=== Size vs. Speed ===<br />
The classical problem of optimization in computer programming, Z80 is no exception.<br />
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM.<br />
Speed can also be needed...<br />
<br />
'''General Unrolling'''<br />
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.<br />
<br />
'''Unroll commands<br />
<nowiki><br />
; "Classic" way : ~21 T-states per byte copied<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size<br />
ldir<br />
<br />
; Unrolled : (16 * size + 10) / n -> ~18 T-states per byte copied when unrolling 8 times<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size ; if the size is not a multiple of the number of unrolled ldi then a small trick must be used to jump appropriately inside the loop for the first iteration<br />
loopldi:<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
jp pe, loopldi ; jp used as it is faster and in the case of a loop unrolling we assume speed matters more than size<br />
</nowiki><br />
This unroll of ldi also works with outi and ldr.<br />
<br />
== Setting flags ==<br />
In some occasion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
; Set parity/overflow (even):<br />
xor a<br />
<br />
Reset parity/overflow (odd):<br />
sub a<br />
<br />
; set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
; reset half carry (hardly ever useful but still...)<br />
or a<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.<br />
<br />
== Related topics ==<br />
* MaxCodez topic<br />
* ticalc docs<br />
* doc by some TI programmer in Balley Z80<br />
<br />
== Acknowledgements ==<br />
* fullmetalcoder<br />
* Galandros</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-04T17:00:34Z<p>Fullmetalcoder: /* Looping */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator. Examples: mapping, grayscale and 3D graphics.<br />
<br />
== Registers and Memory ==<br />
General algorithm improvements and correct use of registers.<br />
<br />
General use of registers:<br />
* a accumulator<br />
* b counter<br />
<br />
* hl 16-bit accumulator/pointer to memory<br />
* de pointer of destination in memory<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peephole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
=== Optimize size and speed ===<br />
<br />
==== Loading stuff ====<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld b,$20<br />
ld c,$30<br />
;try this<br />
ld bc,$2030<br />
;or this<br />
ld bc,(b_num * 256) + c_num<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
xor a<br />
ld (data1),a<br />
ld (data2),a<br />
ld (data3),a<br />
ld (data4),a<br />
ld (data5),a ;if data1 to data5 are one after the other<br />
;try this<br />
ld hl,data1<br />
ld de,data1+1<br />
xor a<br />
ld (hl),a<br />
ld bc,4<br />
ldir<br />
; -> save 3 bytes for every ld (dataX),a<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
ld a,(var)<br />
inc a<br />
ld (var),a<br />
;try this ;if hl is not tied up and all you do is check flags, use indirection:<br />
ld hl,var<br />
inc (hl)<br />
ld a,(hl)<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
==== Math tricks ====<br />
<br />
<nowiki><br />
;Instead of<br />
ld de,-767<br />
add hl,de<br />
;try this<br />
dec h ; -256<br />
dec h ; -512<br />
dec h ; -768<br />
inc hl ; -767<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
srl a<br />
srl a<br />
srl a<br />
;try this<br />
rrca<br />
rrca<br />
rrca<br />
and %00011111<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of<br />
loop:<br />
ld a,2<br />
;code1<br />
ld a,0<br />
;code2<br />
<br />
;try this<br />
ld a,2<br />
loop:<br />
;code1<br />
xor $01<br />
;code2<br />
</nowiki><br />
<br />
==== Looping with 16 bit counter ====<br />
There are two ways to make loops with a 16bit counter :<br />
* the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a<br />
<nowiki><br />
ld bc, ...<br />
loop:<br />
; loop body here<br />
<br />
dec bc<br />
ld a, b<br />
or c<br />
jp nz,loop<br />
</nowiki><br />
* the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states)<br />
<nowiki><br />
dec de<br />
ld b, e<br />
inc b<br />
inc d<br />
loop2:<br />
; loop body here<br />
<br />
djnz loop2<br />
dec d<br />
jp nz,loop2<br />
</nowiki><br />
The rationale behind the second method is to reduce the overhead of the "inner" loop as much as possible and to use the fact that when b gets down to zero it will be treated as 256 by djnz. <br />
<br />
You can therefore use the following macros for setting proper values of 8bit loop counters given a 16bit counter in case you want to do the conversion at compile time :<br />
<br />
<nowiki><br />
#define inner_counter8(counter16) (((counter16) - 1) & 0xff) + 1<br />
#define outer_counter8(counter16) (((counter16) - 1) >> 8) + 1<br />
</nowiki><br />
<br />
=== Size vs. Speed ===<br />
The classical problem of optimization in computer programming, Z80 is no exception.<br />
In ASM most frequently size is what matters because generally ASM is fast enough and it is nice to give a user a smaller program that doesn't use up most RAM.<br />
Speed can also be needed...<br />
<br />
'''General Unrolling'''<br />
You can unroll some loop several times instead of looping, this is used frequently on math routines of multiplication.<br />
<br />
'''Unroll commands<br />
<nowiki><br />
;Unroll ldir to ldi's and make use of flag po<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size<br />
ldir<br />
;to<br />
ld hl,src<br />
ld de,dest<br />
ld bc,size ;size is divisible by the number of ldi's! This is useful when copying fixed size buffers<br />
loopldi:<br />
ldi<br />
ldi<br />
ldi<br />
ldi<br />
ret po ;this flag is set/reset when bc=0<br />
jr loopldi ;at each loop you gain some T-states<br />
</nowiki><br />
This unroll of ldi also works with outi and ldr.<br />
<br />
== Setting flags ==<br />
In some occasion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
; Set parity/overflow (even):<br />
xor a<br />
<br />
Reset parity/overflow (odd):<br />
sub a<br />
<br />
; set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
; reset half carry (hardly ever useful but still...)<br />
or a<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.<br />
<br />
== Related topics ==<br />
* MaxCodez topic<br />
* ticalc docs<br />
* doc by some TI programmer in Balley Z80<br />
<br />
== Acknowledgements ==<br />
* fullmetalcoder<br />
* Galandros</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-04T10:36:46Z<p>Fullmetalcoder: /* General */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator.<br />
<br />
== Registers and memory ==<br />
General algorithm improvements and correct use of registers.<br />
<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peephole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occassion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
; Set parity/overflow (even):<br />
xor a<br />
<br />
Reset parity/overflow (odd):<br />
sub a<br />
<br />
; set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
; reset half carry (hardly ever useful but still...)<br />
or a<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-04T10:35:40Z<p>Fullmetalcoder: </p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator.<br />
<br />
== General ==<br />
General algorithm improvements and correct use of registers.<br />
<br />
<br />
=== Stack ===<br />
<br />
When you run out of registers, stack may offer an interesting alternative to fixed RAM location for temporary storage.<br />
<br />
==== Allocation ====<br />
<br />
You can either allocate stack space with repeated push, which allows to initialize the data but restricts the allocated space to multiples of 2.<br />
An alternate way is to allocate uninitialized stack space (hl may be replaced with an index register) :<br />
<nowiki><br />
; allocates 7 bytes of stack space : 5 bytes, 27 T-states instead of 4 bytes, 44 T-states with 4 push which would have forced the alloc of 8 bytes<br />
ld hl, -7<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki><br />
<br />
==== Access ====<br />
<br />
The most common way of accessing data allocated on stack is to use an index register since all allocated "variables" can be accessed without having to use inc/dec but this is obviously not a strict requirement. Beware though, using stack space is not always optimal in terms of speed, depending (among other things) on your register allocation strategy :<br />
<br />
<nowiki><br />
; 4 bytes, 19 T-states<br />
ld c, (ix + n) ; n is an immediate value in -128..127<br />
<br />
; 4 bytes, 17 T-states, destroys a<br />
ld a, (somelocation)<br />
ld c, a<br />
</nowiki><br />
<br />
If your needs go beyond simple load/store however, this method start to show its real power since it vastly simplify some operations that are complicated to do with fixed storage location (and generally screw up register in the process).<br />
<br />
<nowiki><br />
; 3 bytes, 19 T-states<br />
cp (ix + n)<br />
<br />
sub (ix + n)<br />
sbc a, (ix + n)<br />
add a, (ix + n)<br />
adc a, (ix + n)<br />
<br />
inc (ix + n)<br />
dec (ix + n)<br />
<br />
and (ix + n)<br />
or (ix + n)<br />
xor (ix + n)<br />
<br />
; 4 bytes, 23 T-states<br />
rl (ix + n)<br />
rr (ix + n)<br />
rlc (ix + n)<br />
rrc (ix + n)<br />
sla (ix + n)<br />
sra (ix + n)<br />
sll (ix + n)<br />
srl (ix + n)<br />
bit k, (ix + n) ; k is an immediate value in 0..7<br />
set k, (ix + n)<br />
res k, (ix + n)<br />
</nowiki><br />
<br />
Again, choose wisely between hl and an index register depending on the structure of your data the smallest/fastest allocation solution may vary (hl equivalent instructions are generally 2 bytes smaller and 12 T-states faster but do not allow indexing so may require intermediate inc/dec).<br />
<br />
==== Deallocation ====<br />
<br />
If you want need to pop an entry from the stack but need to preserve all registers remember that sp can be incremented/decremented like any 16bit register :<br />
<nowiki><br />
; drops the top stack entry : waste 1 byte and 2 T-states but may enable better register allocation...<br />
inc sp<br />
inc sp<br />
</nowiki><br />
<br />
If you have a large amount of stack space to drop and a spare 16 bit register (hl, index, or de that you can easily swap with hl) :<br />
<nowiki><br />
; drop 16 bytes of stack space : 5 bytes, 27 T-states instead of 8 bytes, 80 T-states for 8 pop<br />
ld hl, 16<br />
add hl, sp<br />
ld sp, hl<br />
</nowiki> <br />
The larger the space to drop the more T-states you will save, and at some point you'll start saving space as well (beyond 8 bytes)<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peephole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occassion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
; Set parity/overflow (even):<br />
xor a<br />
<br />
Reset parity/overflow (odd):<br />
sub a<br />
<br />
; set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
; reset half carry (hardly ever useful but still...)<br />
or a<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-04T09:56:50Z<p>Fullmetalcoder: /* Small Tricks */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator.<br />
<br />
== General ==<br />
General algorithm improvements and correct use of registers.<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== Small Tricks ==<br />
Note that the following tricks act much like a peephole optimizer and are the last optimization step : remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.<br />
<br />
<nowiki><br />
;Instead of:<br />
cp 0<br />
;Use<br />
or a<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
; -> save 1 byte and 3 T-states<br />
</nowiki><br />
<br />
<nowiki><br />
; Instead of :<br />
ld a, (hl)<br />
ld (de), a<br />
inc hl<br />
inc de<br />
; Use :<br />
ldi<br />
inc bc<br />
; -> save 1 byte and 4 T-states<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occassion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
; Set parity/overflow (even):<br />
xor a<br />
<br />
Reset parity/overflow (odd):<br />
sub a<br />
<br />
; set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
; reset half carry (hardly ever useful but still...)<br />
or a<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-04T09:48:11Z<p>Fullmetalcoder: /* General */</p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator.<br />
<br />
== General ==<br />
General algorithm improvements and correct use of registers.<br />
<br />
=== Shadow registers ===<br />
<br />
In some rare cases, when you run out of registers and cannot to either refactor your algorithm(s) or to rely on RAM storage you may want to use the shadow registers : af', bc', de' and hl'<br />
<br />
These registers behave like their "standard" counterparts (af, bc, de, hl) and you can swap the two register sets at using the following instructions :<br />
<nowiki><br />
ex af, af' ; swaps af and af' as the mnemonic indicates<br />
<br />
exx ; swaps bc, de, hl and bc', de', hl'<br />
</nowiki><br />
<br />
Shadow registers can be of a great help but they come with two drawbacks :<br />
<br />
* they cannot coexist with the "standard" registers : you cannot use ld to assign from a standard to a shadow or vice-versa. Instead you must use nasty constructs such as :<br />
<nowiki><br />
; loads hl' with the contents of hl<br />
push hl<br />
exx<br />
pop hl<br />
</nowiki><br />
<br />
* they require interrupts to be disabled since they are originally intended for use in Interrupt Service Routine. There are situations where it is affordable and others where it isn't. Regardless, it is generally a good policy to restore the previous interrupt status (enabled/disabled) upon return instead of letting it up to the caller. Hopefully it s relatively easy to do (though it does add 4 bytes and 29/33 T-states to the routine) :<br />
<nowiki><br />
ld a, i ; this is the core of the trick, it sets P/V to the value of IFF so P/V is set iff interrupts were enabled at that point<br />
push af ; save flags<br />
di ; disable interrupts<br />
<br />
; do something with shadow registers here<br />
<br />
pop af ; get back flags<br />
ret po ; po = P/V reset so in this case it means interrupts were disabled before the routine was called<br />
ei ; re-enable interrupts<br />
ret<br />
</nowiki><br />
<br />
== Small Tricks ==<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occassion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
; Set parity/overflow (even):<br />
xor a<br />
<br />
Reset parity/overflow (odd):<br />
sub a<br />
<br />
; set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
; reset half carry (hardly ever useful but still...)<br />
or a<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Z80_OptimizationZ80 Optimization2009-11-04T09:14:24Z<p>Fullmetalcoder: </p>
<hr />
<div>{{stub}}<br />
<br />
== Introduction ==<br />
Sometimes it is needed some extra speed in ASM or make your game smaller to fit on the calculator.<br />
<br />
== General ==<br />
General algorithm improvements and correct use of registers.<br />
<br />
== Small Tricks ==<br />
<nowiki><br />
;Instead of:<br />
ld a,0<br />
;Try this:<br />
xor a ;disadvantages: changes flags<br />
;or<br />
sub a ;disadvantages: changes flags<br />
</nowiki><br />
<br />
== Setting flags ==<br />
In some occassion you might want to selectively set/reset a flag.<br />
<br />
Here are the most common uses :<br />
<nowiki><br />
; set Carry flag<br />
scf<br />
<br />
; reset Carry flag (alters Sign and Zero flags as defined)<br />
or a<br />
<br />
; alternate reset Carry flag (alters Sign and Zero flags as defined)<br />
and a<br />
<br />
; set Zero flag (resets Carry flag, alters Sign flag as defined)<br />
cp a<br />
<br />
; reset Zero flag (alters a, reset Carry flag, alters Sign flag as defined)<br />
or 1<br />
<br />
; set Sign flag (negative) (alters a, reset Zero and Carry flags)<br />
or $80<br />
<br />
; reset Sign flag (positive) (set a to zero, set Zero flag, reset Carry flag)<br />
xor a<br />
</nowiki><br />
<br />
Other possible uses (much rarer) :<br />
<nowiki><br />
; Set parity/overflow (even):<br />
xor a<br />
<br />
Reset parity/overflow (odd):<br />
sub a<br />
<br />
; set half carry (hardly ever useful but still...)<br />
and a<br />
<br />
; reset half carry (hardly ever useful but still...)<br />
or a<br />
</nowiki><br />
<br />
As you can see these are extremely simple, small and fast ways to alter flags<br />
which make them interesting as output of routines to indicate error/success or<br />
other status bits that do not require a full register.<br />
<br />
Were you to use this, remember that these flag (re)setting tricks frequently<br />
overlap so if you need a special combination of flags it might require slightly<br />
more elaborate tricks. As a rule of a thumb, always alter the carry last in<br />
such cases because the scf and ccf instructions do not have side effects.</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Talk:83Plus:OS:Variable_Storage_in_the_User_ArchiveTalk:83Plus:OS:Variable Storage in the User Archive2009-10-26T21:21:37Z<p>Fullmetalcoder: /* Application size */</p>
<hr />
<div>== Application size ==<br />
<br />
BrandonW had this to say on the page:<br />
<BrandonW> That "401Ch" thing for the app size, that's completely wrong.<br />
<BrandonW> You have to use the field search routines to find it.<br />
<BrandonW> Just because it's at 401Ch on one doesn't mean it'll be there on another.<br />
It'd be nice if someone who know about this sort of thing to explain how to do this in more detail. For that, we need more general information on parsing headers.<br />
[[User:Dr. D&#39;nar|Dr. D&#39;nar]] 22:17, 10 October 2009 (UTC)<br />
<br />
As far as I know, every fields in app header (save for the date stamp) start with $80 and a byte whose first nibble indicate field type and second nibble indicate field size. The only exception appears to be the two fields of size four for which the "size nibble" if F... Some investigation in the OS implementation of field search might clarify this... [[User:fullmetalcoder]] 14:59 13 October 2009 (GMT+2)<br />
<br />
:The boot code provides lots of routines for searching through app headers ([[83Plus:BCALLs:805A]], [[83Plus:BCALLs:805D]], [[83Plus:BCALLs:8075]], [[83Plus:BCALLs:80AB]], ...) In fact, the OS uses these boot code routines whenever it needs to find data in an app header, certificate, or the OS header.<br />
:It is '''absolutely not''' safe to assume anything about the addresses or lengths of app header fields, apart from the fact that the header is required to be at most 128 bytes in total (so that it fits in a single link packet.) As I recall, WabbitSign provides some excellent examples of how '''not''' to read app headers.<br />
:[[User:FloppusMaximus|FloppusMaximus]] 02:09, 14 October 2009 (UTC)<br />
<br />
::It is safe enough to assume understanding of the "field metadata" byte however. I had figured out most of its meaning, just did not know about the extra two possibilities (word and byte sizes indicated by E and D lower nibble respectively but I finally found them in BootFree source so I can write bcall-free app search routines now yay!)[[User:fullmetalcoder]] 22:18 26 October 2009 (GMT+1)</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Talk:83Plus:OS:Variable_Storage_in_the_User_ArchiveTalk:83Plus:OS:Variable Storage in the User Archive2009-10-13T13:01:36Z<p>Fullmetalcoder: /* Application size */</p>
<hr />
<div>== Application size ==<br />
<br />
BrandonW had this to say on the page:<br />
<BrandonW> That "401Ch" thing for the app size, that's completely wrong.<br />
<BrandonW> You have to use the field search routines to find it.<br />
<BrandonW> Just because it's at 401Ch on one doesn't mean it'll be there on another.<br />
It'd be nice if someone who know about this sort of thing to explain how to do this in more detail. For that, we need more general information on parsing headers.<br />
[[User:Dr. D&#39;nar|Dr. D&#39;nar]] 22:17, 10 October 2009 (UTC)<br />
<br />
As far as I know, every fields in app header (save for the date stamp) start with $80 and a byte whose first nibble indicate field type and second nibble indicate field size. The only exception appears to be the two fields of size four for which the "size nibble" if F... Some investigation in the OS implementation of field search might clarify this... [[User:fullmetalcoder]] 14:59 13 October 2009 (GMT+2)</div>Fullmetalcoderhttps://wikiti.brandonw.net/index.php?title=Talk:83Plus:OS:Variable_Storage_in_the_User_ArchiveTalk:83Plus:OS:Variable Storage in the User Archive2009-10-13T13:01:20Z<p>Fullmetalcoder: /* Application size */</p>
<hr />
<div>== Application size ==<br />
<br />
BrandonW had this to say on the page:<br />
<BrandonW> That "401Ch" thing for the app size, that's completely wrong.<br />
<BrandonW> You have to use the field search routines to find it.<br />
<BrandonW> Just because it's at 401Ch on one doesn't mean it'll be there on another.<br />
It'd be nice if someone who know about this sort of thing to explain how to do this in more detail. For that, we need more general information on parsing headers.<br />
[[User:Dr. D&#39;nar|Dr. D&#39;nar]] 22:17, 10 October 2009 (UTC)<br />
<br />
As far as I know, every fields in app header (save for the date stamp) start with $80 and a byte whose first nibble indicate field type and second nibble indicate field size. The only exception appears to be the two fields of size four for which the "size nibble" if F... Some investigation in the OS implementation of field search might clarify this... [[User:fuulmetalcoder]] 14:59 13 October 2009 (GMT+2)</div>Fullmetalcoder