<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="https://wikiti.brandonw.net/skins/common/feed.css?303"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
		<id>https://wikiti.brandonw.net/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=John+metcalf</id>
		<title>WikiTI - User contributions [en]</title>
		<link rel="self" type="application/atom+xml" href="https://wikiti.brandonw.net/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=John+metcalf"/>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Special:Contributions/John_metcalf"/>
		<updated>2026-04-05T18:49:41Z</updated>
		<subtitle>User contributions</subtitle>
		<generator>MediaWiki 1.23.5</generator>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:Random</id>
		<title>Z80 Routines:Math:Random</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:Random"/>
				<updated>2019-08-02T12:10:22Z</updated>
		
		<summary type="html">&lt;p&gt;John metcalf: Added 16-bit Xorshift&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Z80 Routines:Math|Random]]&lt;br /&gt;
[[Category:Z80 Routines|Random]]&lt;br /&gt;
&lt;br /&gt;
==Ion Random==&lt;br /&gt;
This is based off the tried and true [http://en.wikipedia.org/wiki/PRNG pseudorandom number generator] featured in Ion by Joe Wingbermuehle&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;;-----&amp;gt; Generate a random number&lt;br /&gt;
; output a=answer 0&amp;lt;=a&amp;lt;=255&lt;br /&gt;
; all registers are preserved except: af&lt;br /&gt;
random:&lt;br /&gt;
        push    hl&lt;br /&gt;
        push    de&lt;br /&gt;
        ld      hl,(randData)&lt;br /&gt;
        ld      a,r&lt;br /&gt;
        ld      d,a&lt;br /&gt;
        ld      e,(hl)&lt;br /&gt;
        add     hl,de&lt;br /&gt;
        add     a,l&lt;br /&gt;
        xor     h&lt;br /&gt;
        ld      (randData),hl&lt;br /&gt;
        pop     de&lt;br /&gt;
        pop     hl&lt;br /&gt;
        ret&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
randData here must be a 2 byte seed located in ram.  While this is a fast generator, it's generally not considered very good in terms of randomness.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Linear Feedback Shift Register==&lt;br /&gt;
This particular prng is based on [http://en.wikipedia.org/wiki/LFSR Linear feedback shift register].  It uses a 64bit seed and generates 8 new bits at every call. LFSRSeed must be an 8 byte seed located in ram.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;;------LFSR------&lt;br /&gt;
;James Montelongo&lt;br /&gt;
;optimized by Spencer Putt&lt;br /&gt;
;out:&lt;br /&gt;
; a = 8 bit random number&lt;br /&gt;
RandLFSR:&lt;br /&gt;
        ld hl,LFSRSeed+4&lt;br /&gt;
        ld e,(hl)&lt;br /&gt;
        inc hl&lt;br /&gt;
        ld d,(hl)&lt;br /&gt;
        inc hl&lt;br /&gt;
        ld c,(hl)&lt;br /&gt;
        inc hl&lt;br /&gt;
        ld a,(hl)&lt;br /&gt;
        ld b,a&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        rl c \ rla&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        rl c \ rla&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        rl c \ rla&lt;br /&gt;
        ld h,a&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        rl c \ rla&lt;br /&gt;
        xor b&lt;br /&gt;
        rl e \ rl d&lt;br /&gt;
        xor h&lt;br /&gt;
        xor c&lt;br /&gt;
        xor d&lt;br /&gt;
        ld hl,LFSRSeed+6&lt;br /&gt;
        ld de,LFSRSeed+7&lt;br /&gt;
        ld bc,7&lt;br /&gt;
        lddr&lt;br /&gt;
        ld (de),a&lt;br /&gt;
        ret&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
While this may produces better numbers, it is slower, larger and requires a bigger seed than ionrandom.  Assuming theres is a good seed to start, it should generate ~2^56 bytes before repeating.  However if there is not a good seed(0 for example), then the numbers created will not be adequate.  Unlike Ionrandom and its use of the r register, starting with the same seed the same numbers will be generated. With Ionrandom the code running may have an impact on the number generated. This means this method requires more initialization.&lt;br /&gt;
&lt;br /&gt;
You can initialize with TI-OS's seeds, stored at seed1 and seed2, both are ti-floats but will serve the purpose.&lt;br /&gt;
&lt;br /&gt;
==Combined LFSR/LCG, 16-bit seeds==&lt;br /&gt;
This is a very fast, quality pseudo-random number generator. It combines a 16-bit Linear Feedback Shift Register and a 16-bit LCG.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
prng16:&lt;br /&gt;
;Inputs:&lt;br /&gt;
;   (seed1) contains a 16-bit seed value&lt;br /&gt;
;   (seed2) contains a NON-ZERO 16-bit seed value&lt;br /&gt;
;Outputs:&lt;br /&gt;
;   HL is the result&lt;br /&gt;
;   BC is the result of the LCG, so not that great of quality&lt;br /&gt;
;   DE is preserved&lt;br /&gt;
;Destroys:&lt;br /&gt;
;   AF&lt;br /&gt;
;cycle: 4,294,901,760 (almost 4.3 billion)&lt;br /&gt;
;160cc&lt;br /&gt;
;26 bytes&lt;br /&gt;
    ld hl,(seed1)&lt;br /&gt;
    ld b,h&lt;br /&gt;
    ld c,l&lt;br /&gt;
    add hl,hl&lt;br /&gt;
    add hl,hl&lt;br /&gt;
    inc l&lt;br /&gt;
    add hl,bc&lt;br /&gt;
    ld (seed1),hl&lt;br /&gt;
    ld hl,(seed2)&lt;br /&gt;
    add hl,hl&lt;br /&gt;
    sbc a,a&lt;br /&gt;
    and %00101101&lt;br /&gt;
    xor l&lt;br /&gt;
    ld l,a&lt;br /&gt;
    ld (seed2),hl&lt;br /&gt;
    add hl,bc&lt;br /&gt;
    ret&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
On their own, LCGs and LFSRs don't produce great results and are generally very cyclical, but they are very fast to compute. The 16-bit LCG in the above example will bounce around and reach each number from 0 to 65535, but the lower bits are far more predictable than the upper bits. The LFSR mixes up the predictability of a given bit's state, but it hits every number except 0, meaning there is a slightly higher chance of any given bit in the result being a 1 instead of a 0. It turns out that by adding together the outputs of these two generators, we can lose the predictability of a bit's state, while ensuring it has a 50% chance of being 0 or 1. As well, since the periods, 65536 and 65535 are coprime, then the overall period of the generator is 65535*65536, which is over 4 billion.&lt;br /&gt;
&lt;br /&gt;
==Combined LFSR/LCG, 32-bit seeds==&lt;br /&gt;
This is similar to the one above, except that it uses 32-bit seeds (and still returns a 16-bit result). An advantage here is that they've been tested and passed randomness tests (all of thee ones offered by CAcert labs). As well, it is still very fast.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
rand32:&lt;br /&gt;
;Inputs:&lt;br /&gt;
;   (seed1_0) holds the lower 16 bits of the first seed&lt;br /&gt;
;   (seed1_1) holds the upper 16 bits of the first seed&lt;br /&gt;
;   (seed2_0) holds the lower 16 bits of the second seed&lt;br /&gt;
;   (seed2_1) holds the upper 16 bits of the second seed&lt;br /&gt;
;   **NOTE: seed2 must be non-zero&lt;br /&gt;
;Outputs:&lt;br /&gt;
;   HL is the result&lt;br /&gt;
;   BC,DE can be used as lower quality values, but are not independent of HL.&lt;br /&gt;
;Destroys:&lt;br /&gt;
;   AF&lt;br /&gt;
;Tested and passes all CAcert tests&lt;br /&gt;
;Uses a very simple 32-bit LCG and 32-bit LFSR&lt;br /&gt;
;it has a period of 18,446,744,069,414,584,320&lt;br /&gt;
;roughly 18.4 quintillion.&lt;br /&gt;
;LFSR taps: 0,2,6,7  = 11000101&lt;br /&gt;
;291cc&lt;br /&gt;
seed1_0=$+1&lt;br /&gt;
    ld hl,12345&lt;br /&gt;
seed1_1=$+1&lt;br /&gt;
    ld de,6789&lt;br /&gt;
    ld b,h&lt;br /&gt;
    ld c,l&lt;br /&gt;
    add hl,hl \ rl e \ rl d&lt;br /&gt;
    add hl,hl \ rl e \ rl d&lt;br /&gt;
    inc l&lt;br /&gt;
    add hl,bc&lt;br /&gt;
    ld (seed1_0),hl&lt;br /&gt;
    ld hl,(seed1_1)&lt;br /&gt;
    adc hl,de&lt;br /&gt;
    ld (seed1_1),hl&lt;br /&gt;
    ex de,hl&lt;br /&gt;
seed2_0=$+1&lt;br /&gt;
    ld hl,9876&lt;br /&gt;
seed2_1=$+1&lt;br /&gt;
    ld bc,54321&lt;br /&gt;
    add hl,hl \ rl c \ rl b&lt;br /&gt;
    ld (seed2_1),bc&lt;br /&gt;
    sbc a,a&lt;br /&gt;
    and %11000101&lt;br /&gt;
    xor l&lt;br /&gt;
    ld l,a&lt;br /&gt;
    ld (seed2_0),hl&lt;br /&gt;
    ex de,hl&lt;br /&gt;
    add hl,bc&lt;br /&gt;
    ret&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Xorshift==&lt;br /&gt;
&lt;br /&gt;
Xorshift is a class of pseudorandom number generators discover by George Marsaglia and detailed in his 2003 paper, ''Xorshift RNGs''. &lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; 16-bit xorshift pseudorandom number generator by John Metcalf&lt;br /&gt;
; 20 bytes, 86 cycles (excluding ret)&lt;br /&gt;
&lt;br /&gt;
; returns   hl = pseudorandom number&lt;br /&gt;
; corrupts   a&lt;br /&gt;
&lt;br /&gt;
; generates 16-bit pseudorandom numbers with a period of 65535&lt;br /&gt;
; using the xorshift method:&lt;br /&gt;
&lt;br /&gt;
; hl ^= hl &amp;amp;lt;&amp;amp;lt; 7&lt;br /&gt;
; hl ^= hl &amp;amp;gt;&amp;amp;gt; 9&lt;br /&gt;
; hl ^= hl &amp;amp;lt;&amp;amp;lt; 8&lt;br /&gt;
&lt;br /&gt;
; some alternative shift triplets which also perform well are:&lt;br /&gt;
; 6, 7, 13; 7, 9, 13; 9, 7, 13.&lt;br /&gt;
&lt;br /&gt;
  org 32768&lt;br /&gt;
&lt;br /&gt;
xrnd:&lt;br /&gt;
  ld hl,1       ; seed must not be 0&lt;br /&gt;
&lt;br /&gt;
  ld a,h&lt;br /&gt;
  rra&lt;br /&gt;
  ld a,l&lt;br /&gt;
  rra&lt;br /&gt;
  xor h&lt;br /&gt;
  ld h,a&lt;br /&gt;
  ld a,l&lt;br /&gt;
  rra&lt;br /&gt;
  ld a,h&lt;br /&gt;
  rra&lt;br /&gt;
  xor l&lt;br /&gt;
  ld l,a&lt;br /&gt;
  xor h&lt;br /&gt;
  ld h,a&lt;br /&gt;
&lt;br /&gt;
  ld (xrnd+1),hl&lt;br /&gt;
&lt;br /&gt;
  ret&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;/div&gt;</summary>
		<author><name>John metcalf</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:Multiplication</id>
		<title>Z80 Routines:Math:Multiplication</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:Multiplication"/>
				<updated>2019-08-02T11:41:44Z</updated>
		
		<summary type="html">&lt;p&gt;John metcalf: /* Signed versions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Z80 Routines:Math|Multiplication]]&lt;br /&gt;
[[Category:Z80 Routines|Multiplication]]&lt;br /&gt;
&lt;br /&gt;
== Introduction ==&lt;br /&gt;
&lt;br /&gt;
All these routines use the restoring multiplication algorithm, adapted to the z80 architecture to maximize speed.&lt;br /&gt;
They can easily be unrolled to gain some speed.&lt;br /&gt;
&lt;br /&gt;
== Unsigned versions ==&lt;br /&gt;
&lt;br /&gt;
=== 8*8 multiplication ===&lt;br /&gt;
&lt;br /&gt;
The following routine multiplies h by e and places the result in hl&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
mult_h_e&lt;br /&gt;
   ld	l, 0&lt;br /&gt;
   ld	d, l&lt;br /&gt;
&lt;br /&gt;
   sla	h		; optimised 1st iteration&lt;br /&gt;
   jr	nc, $+3&lt;br /&gt;
   ld	l, e&lt;br /&gt;
   &lt;br /&gt;
   ld b, 7&lt;br /&gt;
_loop:&lt;br /&gt;
   add	hl, hl          &lt;br /&gt;
   jr	nc, $+3&lt;br /&gt;
   add	hl, de&lt;br /&gt;
   &lt;br /&gt;
   djnz	_loop&lt;br /&gt;
   &lt;br /&gt;
   ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== 16*8 multiplication ===&lt;br /&gt;
&lt;br /&gt;
The following routine multiplies de by a and places the result in ahl (which means a is the most significant byte of the product, l the least significant and h the intermediate one...)&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
mult_a_de&lt;br /&gt;
   ld	c, 0&lt;br /&gt;
   ld	h, c&lt;br /&gt;
   ld	l, h&lt;br /&gt;
&lt;br /&gt;
   add	a, a		; optimised 1st iteration&lt;br /&gt;
   jr	nc, $+4&lt;br /&gt;
   ld	h,d&lt;br /&gt;
   ld	l,e&lt;br /&gt;
&lt;br /&gt;
   ld b, 7&lt;br /&gt;
_loop:&lt;br /&gt;
   add	hl, hl&lt;br /&gt;
   rla&lt;br /&gt;
   jr	nc, $+4&lt;br /&gt;
   add	hl, de&lt;br /&gt;
   adc	a, c            ; yes this is actually adc a, 0 but since c is free we set it to zero and so we can save 1 byte and up to 3 T-states per iteration&lt;br /&gt;
   &lt;br /&gt;
   djnz	_loop&lt;br /&gt;
   &lt;br /&gt;
   ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== 16*16 multiplication ===&lt;br /&gt;
&lt;br /&gt;
The following routine multiplies bc by de and places the result in dehl.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
mult_de_bc&lt;br /&gt;
   ld	hl, 0&lt;br /&gt;
&lt;br /&gt;
   sla	e		; optimised 1st iteration&lt;br /&gt;
   rl	d&lt;br /&gt;
   jr	nc, $+4&lt;br /&gt;
   ld	h, b&lt;br /&gt;
   ld	l, c&lt;br /&gt;
&lt;br /&gt;
   ld	a, 15&lt;br /&gt;
_loop:&lt;br /&gt;
   add	hl, hl&lt;br /&gt;
   rl	e&lt;br /&gt;
   rl	d&lt;br /&gt;
   jr	nc, $+6&lt;br /&gt;
   add	hl, bc&lt;br /&gt;
   jr	nc, $+3&lt;br /&gt;
   inc	de&lt;br /&gt;
   &lt;br /&gt;
   dec	a&lt;br /&gt;
   jr	nz, _loop&lt;br /&gt;
   &lt;br /&gt;
   ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Signed versions ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- do the routines later, basically try to use cpu instruction that preserve bit 7 (sign) or if this is not possible, just take the sign out, do the multiplication and give the sign afterwards and accordingly. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== 8*8 multiplication ===&lt;br /&gt;
&lt;br /&gt;
=== 16*8 multiplication ===&lt;br /&gt;
&lt;br /&gt;
=== 16*16 multiplication ===&lt;br /&gt;
&lt;br /&gt;
The following routine multiplies bc by de and places the signed result in dehl.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;; 16 bit signed multiply by John Metcalf&lt;br /&gt;
; 32 bytes, average approx ~1090 cycles&lt;br /&gt;
&lt;br /&gt;
; call with bc, de = 16 bit signed numbers to multiply&lt;br /&gt;
; returns   de:hl = 32 bit signed product&lt;br /&gt;
; corrupts  a&lt;br /&gt;
&lt;br /&gt;
; de:hl = bc*de&lt;br /&gt;
&lt;br /&gt;
multiply:&lt;br /&gt;
  xor a&lt;br /&gt;
  ld h,a&lt;br /&gt;
  ld l,a&lt;br /&gt;
  bit 7,d&lt;br /&gt;
  jr z,muldpos&lt;br /&gt;
  sbc hl,bc&lt;br /&gt;
muldpos:&lt;br /&gt;
  ld a,b&lt;br /&gt;
  sra a&lt;br /&gt;
  and 0C0h&lt;br /&gt;
  add a,d&lt;br /&gt;
  ld d,a&lt;br /&gt;
&lt;br /&gt;
  ld a,16&lt;br /&gt;
mulloop:&lt;br /&gt;
  add hl,hl&lt;br /&gt;
  rl e&lt;br /&gt;
  rl d&lt;br /&gt;
  jr nc,mul0bit&lt;br /&gt;
  add hl,bc&lt;br /&gt;
  jr nc,mul0bit&lt;br /&gt;
  inc de&lt;br /&gt;
mul0bit:&lt;br /&gt;
  dec a&lt;br /&gt;
  jr nz,mulloop&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;/div&gt;</summary>
		<author><name>John metcalf</name></author>	</entry>

	<entry>
		<id>https://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:Square_root</id>
		<title>Z80 Routines:Math:Square root</title>
		<link rel="alternate" type="text/html" href="https://wikiti.brandonw.net/index.php?title=Z80_Routines:Math:Square_root"/>
				<updated>2019-08-02T11:31:03Z</updated>
		
		<summary type="html">&lt;p&gt;John metcalf: Add code for square root in max 379 t-states&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Z80 Routines:Math|Square root]]&lt;br /&gt;
[[Category:Z80 Routines|Square root]]&lt;br /&gt;
&lt;br /&gt;
==Size Optimization==&lt;br /&gt;
This version is size optimized, it compares every perfect square against HL until a square that is larger is found.  Obviously slower, but does get the job done in only 12 bytes.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;;-------------------------------&lt;br /&gt;
;Square Root&lt;br /&gt;
;Inputs:&lt;br /&gt;
;HL = number to be square rooted&lt;br /&gt;
;Outputs:&lt;br /&gt;
;A  = square root&lt;br /&gt;
&lt;br /&gt;
sqrt:&lt;br /&gt;
   ld a,$ff&lt;br /&gt;
   ld de,1&lt;br /&gt;
sqrtloop:&lt;br /&gt;
   inc a&lt;br /&gt;
   dec e&lt;br /&gt;
   dec de&lt;br /&gt;
   add hl,de&lt;br /&gt;
   jr c,sqrtloop&lt;br /&gt;
   ret &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Speed Optimization==&lt;br /&gt;
This version is an unrolled version of a square root algorithm taking at most 555 t-states. The downside is that it takes 111 bytes, but if speed is a priority, this is probably the best option. Most of the unrolled iterations are specially optimised for their location in the algorithm.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;;-------------------------------&lt;br /&gt;
;Square Root&lt;br /&gt;
;Inputs:&lt;br /&gt;
;HL = number to be square rooted&lt;br /&gt;
;Outputs:&lt;br /&gt;
;A  = square root&lt;br /&gt;
;111 bytes&lt;br /&gt;
;555 t-states worst case&lt;br /&gt;
SqrtHL:&lt;br /&gt;
;zero some registers&lt;br /&gt;
	xor a&lt;br /&gt;
	ld c,a&lt;br /&gt;
	ld d,a&lt;br /&gt;
&lt;br /&gt;
;move the LSB of the input into E for later use, then shift the LSB into L and load H with 0.&lt;br /&gt;
;H will be a carry register, where the bits in L are rotated in&lt;br /&gt;
	ld e,l&lt;br /&gt;
	ld l,h&lt;br /&gt;
	ld h,c&lt;br /&gt;
&lt;br /&gt;
;Iteration 1 is optimised&lt;br /&gt;
; C is treated as the accumulator&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	sub h&lt;br /&gt;
	jr nc,$+5&lt;br /&gt;
	inc c&lt;br /&gt;
	cpl&lt;br /&gt;
	ld h,a&lt;br /&gt;
&lt;br /&gt;
;Iteration 2&lt;br /&gt;
; rotate in 2 more bits from the MSB of the input into H&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl&lt;br /&gt;
; shift the accumulator&lt;br /&gt;
	rl c&lt;br /&gt;
	ld a,c&lt;br /&gt;
	rla&lt;br /&gt;
; A is now double the shifted accumulator&lt;br /&gt;
	sub h&lt;br /&gt;
; doubles as a comparison of the carry register (H) to double the accumulator&lt;br /&gt;
	jr nc,$+5&lt;br /&gt;
; If the carry is &amp;gt; 2*accumulator, the bit in the accumulator needs to be 1:&lt;br /&gt;
	inc c&lt;br /&gt;
; We need to perform H-(2C+1), but A=2C-H.&lt;br /&gt;
; We could do NEG to get A=H-2C, then DEC A, but NEG = CPL \ INC A&lt;br /&gt;
; NEG \ DEC A  =  CPL \ INC A \ DEC A&lt;br /&gt;
; So just use CPL, saving 8 t-states, 1 byte&lt;br /&gt;
	cpl&lt;br /&gt;
	ld h,a&lt;br /&gt;
&lt;br /&gt;
;Iteration 3&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	rl c&lt;br /&gt;
	ld a,c&lt;br /&gt;
	rla&lt;br /&gt;
	sub h&lt;br /&gt;
	jr nc,$+5&lt;br /&gt;
	inc c&lt;br /&gt;
	cpl&lt;br /&gt;
	ld h,a&lt;br /&gt;
&lt;br /&gt;
;Iteration 4&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	rl c&lt;br /&gt;
	ld a,c&lt;br /&gt;
	rla&lt;br /&gt;
	sub h&lt;br /&gt;
	jr nc,$+5&lt;br /&gt;
	inc c&lt;br /&gt;
	cpl&lt;br /&gt;
	ld h,a&lt;br /&gt;
&lt;br /&gt;
;L is 0, H is the current carry&lt;br /&gt;
;E is the lower 8 bits&lt;br /&gt;
; Load the next set of bits (LSB of input) into L so that they can be rotated into H&lt;br /&gt;
	ld l,e&lt;br /&gt;
&lt;br /&gt;
;Iteration 5&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	rl c&lt;br /&gt;
	ld a,c&lt;br /&gt;
	rla&lt;br /&gt;
	sub h&lt;br /&gt;
	jr nc,$+5&lt;br /&gt;
	inc c&lt;br /&gt;
	cpl&lt;br /&gt;
	ld h,a&lt;br /&gt;
&lt;br /&gt;
;Iteration 6&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	rl c&lt;br /&gt;
	ld a,c&lt;br /&gt;
	rla&lt;br /&gt;
	sub h&lt;br /&gt;
	jr nc,$+5&lt;br /&gt;
	inc c&lt;br /&gt;
	cpl&lt;br /&gt;
	ld h,a&lt;br /&gt;
&lt;br /&gt;
;Iteration 7&lt;br /&gt;
; Now we need to start worrying about 8 bit overflow.&lt;br /&gt;
; In particular, the carry register, H should be ideally 9 bits for this iteration, 10 for the last.&lt;br /&gt;
; The accumulator, C, is 8 bits, but we need to compare H to 2*C, and 2*C is up to 9 bits on the last iteration.&lt;br /&gt;
;l has 4 more bits to rotate into h&lt;br /&gt;
&lt;br /&gt;
	sla c \ ld a,c \ add a,a&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	add hl,hl&lt;br /&gt;
	jr nc,$+6&lt;br /&gt;
	sub h \ jp $+6&lt;br /&gt;
	sub h&lt;br /&gt;
	jr nc,$+5&lt;br /&gt;
	inc c&lt;br /&gt;
	cpl&lt;br /&gt;
	ld h,a&lt;br /&gt;
&lt;br /&gt;
;Iteration 8&lt;br /&gt;
; A lot of fancy stuff here&lt;br /&gt;
; D is 0, from way back at the beginning&lt;br /&gt;
; now I put H-&amp;gt;E so that DE can hold the potentially 10-bit number&lt;br /&gt;
; Now C-&amp;gt;A, L-&amp;gt;H&lt;br /&gt;
; H thus has the last two bits of the input that need to be rotated into DE&lt;br /&gt;
; L has the value of the accumualtor which needs to be multiplied by 4 for a comparison to DE&lt;br /&gt;
; So 2 shifts of HL into DE results in DE holding the carry, HL holding 4*accumulated result!	&lt;br /&gt;
	ld e,h&lt;br /&gt;
	ld h,l&lt;br /&gt;
	ld l,c&lt;br /&gt;
   	ld a,l&lt;br /&gt;
	add hl,hl \ rl e \ rl d&lt;br /&gt;
	add hl,hl \ rl e \ rl d&lt;br /&gt;
	sbc hl,de&lt;br /&gt;
;the c flag now has the state of the last bit of the result, HL does not need to be restored.&lt;br /&gt;
	rla&lt;br /&gt;
	ret&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Speed Optimization 2==&lt;br /&gt;
This 92 byte version is an optimized unrolled loop taking at most 379 t-states. Each iteration is optimized for its location in the algorithm.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
; fast 16 bit isqrt by John Metcalf&lt;br /&gt;
; 92 bytes, 344-379 cycles (average 362)&lt;br /&gt;
; v2 - saved 3 cycles with a tweak suggested by Russ McNulty&lt;br /&gt;
&lt;br /&gt;
; call with hl = number to square root&lt;br /&gt;
; returns    a = square root&lt;br /&gt;
; corrupts hl, de&lt;br /&gt;
&lt;br /&gt;
; ----------&lt;br /&gt;
&lt;br /&gt;
  ld a,h        ; 4&lt;br /&gt;
  ld de,0B0C0h  ; 10&lt;br /&gt;
  add a,e       ; 4&lt;br /&gt;
  jr c,sq7      ; 12 / 7&lt;br /&gt;
  ld a,h        ; 4&lt;br /&gt;
  ld d,0F0h     ; 7&lt;br /&gt;
sq7:&lt;br /&gt;
&lt;br /&gt;
; ----------&lt;br /&gt;
&lt;br /&gt;
  add a,d       ; 4&lt;br /&gt;
  jr nc,sq6     ; 12 / 7&lt;br /&gt;
  res 5,d       ; 8&lt;br /&gt;
  db 254        ; 7&lt;br /&gt;
sq6:&lt;br /&gt;
  sub d         ; 4&lt;br /&gt;
  sra d         ; 8&lt;br /&gt;
&lt;br /&gt;
; ----------&lt;br /&gt;
&lt;br /&gt;
  set 2,d       ; 8&lt;br /&gt;
  add a,d       ; 4&lt;br /&gt;
  jr nc,sq5     ; 12 / 7&lt;br /&gt;
  res 3,d       ; 8&lt;br /&gt;
  db 254        ; 7&lt;br /&gt;
sq5:&lt;br /&gt;
  sub d         ; 4&lt;br /&gt;
  sra d         ; 8&lt;br /&gt;
&lt;br /&gt;
; ----------&lt;br /&gt;
&lt;br /&gt;
  inc d         ; 4&lt;br /&gt;
  add a,d       ; 4&lt;br /&gt;
  jr nc,sq4     ; 12 / 7&lt;br /&gt;
  res 1,d       ; 8&lt;br /&gt;
  db 254        ; 7&lt;br /&gt;
sq4:&lt;br /&gt;
  sub d         ; 4&lt;br /&gt;
  sra d         ; 8&lt;br /&gt;
  ld h,a        ; 4&lt;br /&gt;
&lt;br /&gt;
; ----------&lt;br /&gt;
&lt;br /&gt;
  add hl,de     ; 11&lt;br /&gt;
  jr nc,sq3     ; 12 / 7&lt;br /&gt;
  ld e,040h     ; 7&lt;br /&gt;
  db 210        ; 10&lt;br /&gt;
sq3:&lt;br /&gt;
  sbc hl,de     ; 15&lt;br /&gt;
  sra d         ; 8&lt;br /&gt;
  ld a,e        ; 4&lt;br /&gt;
  rra           ; 4&lt;br /&gt;
&lt;br /&gt;
; ----------&lt;br /&gt;
&lt;br /&gt;
  or 010h       ; 7&lt;br /&gt;
  ld e,a        ; 4&lt;br /&gt;
  add hl,de     ; 11&lt;br /&gt;
  jr nc,sq2     ; 12 / 7&lt;br /&gt;
  and 0DFh      ; 7&lt;br /&gt;
  db 218        ; 10&lt;br /&gt;
sq2:&lt;br /&gt;
  sbc hl,de     ; 15&lt;br /&gt;
  sra d         ; 8&lt;br /&gt;
  rra           ; 4&lt;br /&gt;
&lt;br /&gt;
; ----------&lt;br /&gt;
&lt;br /&gt;
  or 04h        ; 7&lt;br /&gt;
  ld e,a        ; 4&lt;br /&gt;
  add hl,de     ; 11&lt;br /&gt;
  jr nc,sq1     ; 12 / 7&lt;br /&gt;
  and 0F7h      ; 7&lt;br /&gt;
  db 218        ; 10&lt;br /&gt;
sq1:&lt;br /&gt;
  sbc hl,de     ; 15&lt;br /&gt;
  sra d         ; 8&lt;br /&gt;
  rra           ; 4&lt;br /&gt;
&lt;br /&gt;
; ----------&lt;br /&gt;
&lt;br /&gt;
  inc a         ; 4&lt;br /&gt;
  ld e,a        ; 4&lt;br /&gt;
  add hl,de     ; 11&lt;br /&gt;
  jr nc,sq0     ; 12 / 7&lt;br /&gt;
  and 0FDh      ; 7&lt;br /&gt;
sq0:&lt;br /&gt;
  sra d         ; 8&lt;br /&gt;
  rra           ; 4&lt;br /&gt;
  cpl           ; 4&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Balanced Optimization==&lt;br /&gt;
This version is a balance between speed and size. It also uses the high school method and runs under 1200 tstates. It only costs 41 bytes.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;;-------------------------------&lt;br /&gt;
;Square Root&lt;br /&gt;
;Inputs:&lt;br /&gt;
;DE = number to be square rooted&lt;br /&gt;
;Outputs:&lt;br /&gt;
;A  = square root&lt;br /&gt;
&lt;br /&gt;
Sqrt:&lt;br /&gt;
    ld hl,0&lt;br /&gt;
    ld c,l&lt;br /&gt;
    ld b,h&lt;br /&gt;
    ld a,8&lt;br /&gt;
Sqrtloop:&lt;br /&gt;
    sla e&lt;br /&gt;
    rl d&lt;br /&gt;
    adc hl,hl&lt;br /&gt;
    sla e&lt;br /&gt;
    rl d&lt;br /&gt;
    adc hl,hl&lt;br /&gt;
    scf               ;Can be optimised&lt;br /&gt;
    rl c              ;with SL1 instruction&lt;br /&gt;
    rl b&lt;br /&gt;
    sbc hl,bc&lt;br /&gt;
    jr nc,Sqrtaddbit&lt;br /&gt;
    add hl,bc&lt;br /&gt;
    dec c&lt;br /&gt;
Sqrtaddbit:&lt;br /&gt;
    inc c&lt;br /&gt;
    res 0,c&lt;br /&gt;
    dec a&lt;br /&gt;
    jr nz,Sqrtloop&lt;br /&gt;
    ld a,c&lt;br /&gt;
    rr b&lt;br /&gt;
    rra&lt;br /&gt;
    ret&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Presumably the best ==&lt;br /&gt;
&lt;br /&gt;
This code was found on z80 bits and has the advantage of being both faster than all three versions above and smaller than the last two (it runs in under 720 T-states (under 640 if fully unrolled) and takes a mere 29 bytes). On the other hand it takes a somewhat unconventionnal input... It computes the square root of the 16bit number formed by la and places the result in d.&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
sqrt_la:&lt;br /&gt;
	ld	de, 0040h	; 40h appends &amp;quot;01&amp;quot; to D&lt;br /&gt;
	ld	h, d&lt;br /&gt;
	&lt;br /&gt;
	ld	b, 7&lt;br /&gt;
	&lt;br /&gt;
	; need to clear the carry beforehand&lt;br /&gt;
	or	a&lt;br /&gt;
	&lt;br /&gt;
_loop:&lt;br /&gt;
	sbc	hl, de&lt;br /&gt;
	jr	nc, $+3&lt;br /&gt;
	add	hl, de&lt;br /&gt;
	ccf&lt;br /&gt;
	rl	d&lt;br /&gt;
	rla&lt;br /&gt;
	adc	hl, hl&lt;br /&gt;
	rla&lt;br /&gt;
	adc	hl, hl&lt;br /&gt;
	&lt;br /&gt;
	djnz	_loop&lt;br /&gt;
	&lt;br /&gt;
	sbc	hl, de		; optimised last iteration&lt;br /&gt;
	ccf&lt;br /&gt;
	rl	d&lt;br /&gt;
	&lt;br /&gt;
	ret&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Other Options==&lt;br /&gt;
A binary search of a square table would yield much better best case scenarios and the worst case scenarios would be similar to the high school method. However this would also require 512 byte table making it significantly larger than the other routines.  Of course the table could also serve as a rapid squaring method.&lt;br /&gt;
&lt;br /&gt;
== Credits and Contributions ==&lt;br /&gt;
* '''James Montelongo'''&lt;br /&gt;
* '''Milos &amp;quot;baze&amp;quot; Bazelides''' (or possibly one of the contributor of [http://baze.au.com/misc/z80bits.html z80bits])&lt;br /&gt;
* '''John Metcalf''' (Speed Optimization 2 from [https://github.com/impomatic/z80snippets/blob/master/fastisqr.asm z80snippets])&lt;/div&gt;</summary>
		<author><name>John metcalf</name></author>	</entry>

	</feed>