This article explores the pop slide, a high-performance technique for rapid data copying on the Game Boy. This trick was introduced to me by Dave VanEe, a reader of my book, Game Boy Coding Adventure.
We will begin with a standard "brute force" copy, which is easy to implement but relatively slow. From there, we will refine it into an optimized version. Finally, we will implement the pop slide, the fastest way to copy data on the hardware, while navigating a few of its unique trade-offs. By the end, you will have the knowledge to select the perfect copy method for your own projects.
To demonstrate these techniques, we’ll use the simple sample shown in Figure 1, applying each of our three copy methods.
Full commented source code for the samples is available on GitHub. You will also find pre-built ROMs there, allowing you to test the performance on the BGB emulator or real hardware immediately.
These samples are written in assembly for the RGBDS toolchain. If you are new to assembly or looking for a deep dive into Game Boy hardware, my book, Game Boy Coding Adventure, provides a comprehensive, beginner-friendly guide to everything you need to know.
To prove the efficiency of the pop slide, we will measure the cycles required to copy 8 KiB of data into VRAM. We use the two macros in Listing 1 to interface with BGB’s built-in profiler.
; Reset BGB's clock counter.
macro ResetClockCounter
ld d, d
jr .skip\@
dw $6464
dw $0000
db "Clock counter reset %ZEROCLKS%", 0
.skip\@
endm
; Print BGB's clock counter value.
macro PrintClockCounter
ld d, d
jr .skip\@
dw $6464
dw $0000
db "Cycles: %-8+LASTCLKS%", 0
.skip\@
endm
ResetClockCounter resets the clock counter to zero, while PrintClockCounter displays the elapsed clocks in the BGB debug window.
A few quick notes on interpreting these results:
LASTCLKS value to account for the overhead of the debug call.
Listing 2 shows how we measure our CopyData routine:
ResetClockCounter CopyData graphics_data, _VRAM, 8 * 1024 PrintClockCounter
We basically sandwich the CopyData function between ResetClockCounter and PrintClockCounter.
This will display the number of clocks taken by the copy to the debug messages windows each time the sample is run.
Our baseline method consists of a simple loop to copy data byte-by-byte, as shown in Listing 3.
macro CopyData
ld de, \1 ; 3
ld hl, \2 ; 3
ld bc, \3 ; 3
.loop\@
; copy a byte from source to destination
ld a, [de] ; 2
ld [hli], a ; 2
inc de ; 2
; check if the copy is over
dec bc ; 2
ld a, c ; 1
or a, b ; 1
jr nz, .loop\@ ; 3
endm
Each iteration costs 13 cycles. Copying 8,192 bytes results in 106,504 total cycles. BGB will report 34010 (hex), or 213,008 clocks. Dividing by 2 yields 106,504, which confirms our computed cycle count.
To improve performance, we can batch our copy operations.
By unrolling the loop to handle 32 bytes at a time, we reduce the overhead caused by conditional jumps (jr) and simplify the loop counter by using an 8-bit register instead of a 16-bit one.
While this increases the ROM footprint by 90 bytes, the cycle savings are significant.
macro CopyData
assert \3 & $1F == 0, "The size must be a multiple of 32"
assert \3 ≥ 32 * 256, "The size must be inferior or equal to 8192"
ld de, \1 ; 3
ld hl, \2 ; 3
ld c, (\3 / 32) & 255 ; 2
.loop\@
; copy 32 bytes
rept 32
ld a, [de] ; 2
ld [hli], a ; 2
inc de ; 2
endr
; check if the copy is over
dec c ; 1
jr nz, .loop\@ ; 3
endm
The cost of a single iteration is 196 cycles (32 * 6 + 1 + 3) for the copy of 32 bytes.
There are 256 iterations (8,192 / 32), which makes 50,175 cycles (196 * 256 - 1 for the last jr) for the loop and a grand total of 50,183 cycles when we add the 8 cycles to set the registers before the loop begins.
BGB reports 1880E (hex), or 100,366 clocks.
Dividing the clocks by 2 yields 50,183 cycles, which is again exactly what was expected.
Performance-wise, this method is a massive improvement over the brute force approach.
The pop slide is a clever trick that leverages the stack pointer to bypass standard loading instructions.
By setting the stack pointer sp to the source data and using pop instead of ld, we can retrieve two bytes at a time, significantly speeding up the transfer.
The code for the pop slide is shown is Listing 5.
macro CopyData
assert \3 & $1F == 0, "The size must be a multiple of 32"
assert \3 ≥ 32 * 256, "The size must be inferior or equal to 8192"
; save the stack pointer
ld [WRAM_STACK_POINTER], sp ; 5
ld sp, \1 ; 3
ld hl, \2 ; 3
ld c, (\3 / 32) & 255 ; 2
.loop\@
; copy 32 bytes
rept 16
pop de ; 3
ld a, e ; 1
ld [hli], a ; 2
ld a, d ; 1
ld [hli], a ; 2
endr
; check if the copy is over
dec c ; 1
jr nz, .loop\@ ; 3
; restore the stack pointer
ld hl, WRAM_STACK_POINTER ; 3
ld a, [hli] ; 2
ld h, [hl] ; 2
ld l, a ; 1
ld sp, hl ; 2
endm
Let's go through measuring the cycles again.
The cost of a single iteration is 148 cycles (16 * 9 + 1 + 3) for the copy of 32 bytes.
There are 256 iterations (8,192 / 32), which makes 37,887 cycles (148 * 256 - 1 for the last jr) for the loop and a grand total of 37,910 cycles when we add the 13 cycles to set the registers before the loop begins, and the 10 cycles to restore the stack pointer to its original value after the loop.
BGB reports 1282C (hex), or 75,820 clocks.
Dividing the clocks by 2 yields 37,910 cycles, which is yet again exactly what was expected.
Crucial Note: Because this method overrides the stack pointer, you must disable interrupts (using di) during the copy.
If an interrupt occurs, the CPU will attempt to push the program counter onto a stack that no longer points to a valid memory location, leading to crashes or data corruption.
If you aren't already running in a context where interrupts are off (like inside the vblank function), be sure to sandwich the routine between di and ei.
In the sample, interrupts are already disabled when we perform the copy, because the code is run during the initialization of the sample, which is why we don't use di nor ei.
In this article, we studied three methods to copy data on the Game Boy, with a focus on the pop slide. You should now have a good grasp of how to implement the pop slide. You should also understand its benefits and drawbacks, and be able to use it properly in your games. Table 1 lists up the copy methods along their total cycles cost and cycles per byte.
| Method | Total Cycles | Cycles/Byte |
|---|---|---|
| Brute force | 106,504 | 13.00 |
| Optimized | 50,183 | 6.13 |
| Pop Slide | 37,910 | 4.63 |
The pop slide is an excellent tool for intensive vblank tasks. While it requires care regarding interrupts, the speed boost is worth the effort. Note, however, that if you are developing for the Game Boy Color, you should use hardware DMA for VRAM transfers, as it achieves an impressive 2 cycles per byte that cannot be match by any other method.
The image used in the samples is from xCrossbite.