Pop Slide on the Game Boy

This article explores the pop slide, a high-performance technique for rapid data copying on the Game Boy. This trick was introduced to me by Dave VanEe, a reader of my book, Game Boy Coding Adventure.

We will begin with a standard "brute force" copy, which is easy to implement but relatively slow. From there, we will refine it into an optimized version. Finally, we will implement the pop slide, the fastest way to copy data on the hardware, while navigating a few of its unique trade-offs. By the end, you will have the knowledge to select the perfect copy method for your own projects.


The Setup

To demonstrate these techniques, we’ll use the simple sample shown in Figure 1, applying each of our three copy methods.

Figure 1: The test sample

Full commented source code for the samples is available on GitHub. You will also find pre-built ROMs there, allowing you to test the performance on the BGB emulator or real hardware immediately.

These samples are written in assembly for the RGBDS toolchain. If you are new to assembly or looking for a deep dive into Game Boy hardware, my book, Game Boy Coding Adventure, provides a comprehensive, beginner-friendly guide to everything you need to know.

To prove the efficiency of the pop slide, we will measure the cycles required to copy 8 KiB of data into VRAM. We use the two macros in Listing 1 to interface with BGB’s built-in profiler.

; Reset BGB's clock counter.
macro ResetClockCounter
    ld d, d
    jr .skip\@
    dw $6464
    dw $0000
    db "Clock counter reset %ZEROCLKS%", 0
    .skip\@
endm

; Print BGB's clock counter value.
macro PrintClockCounter
    ld d, d
    jr .skip\@
    dw $6464
    dw $0000
    db "Cycles: %-8+LASTCLKS%", 0
    .skip\@
endm
Listing 1: Performance measurement macros

ResetClockCounter resets the clock counter to zero, while PrintClockCounter displays the elapsed clocks in the BGB debug window. A few quick notes on interpreting these results:

Listing 2 shows how we measure our CopyData routine:

ResetClockCounter
CopyData graphics_data, _VRAM, 8 * 1024
PrintClockCounter
Listing 2: Measuring the copy function performance

We basically sandwich the CopyData function between ResetClockCounter and PrintClockCounter. This will display the number of clocks taken by the copy to the debug messages windows each time the sample is run.


Brute Force Copy

Our baseline method consists of a simple loop to copy data byte-by-byte, as shown in Listing 3.

macro CopyData
    ld de, \1           ; 3
    ld hl, \2           ; 3
    ld bc, \3           ; 3
    .loop\@
        ; copy a byte from source to destination
        ld a, [de]      ; 2
        ld [hli], a     ; 2
        inc de          ; 2

        ; check if the copy is over
        dec bc          ; 2
        ld a, c         ; 1
        or a, b         ; 1
        jr nz, .loop\@  ; 3
endm
Listing 3: Brute force approach

Each iteration costs 13 cycles. Copying 8,192 bytes results in 106,504 total cycles. BGB will report 34010 (hex), or 213,008 clocks. Dividing by 2 yields 106,504, which confirms our computed cycle count.


Optimized Copy

To improve performance, we can batch our copy operations. By unrolling the loop to handle 32 bytes at a time, we reduce the overhead caused by conditional jumps (jr) and simplify the loop counter by using an 8-bit register instead of a 16-bit one. While this increases the ROM footprint by 90 bytes, the cycle savings are significant.

macro CopyData
    assert \3 & $1F == 0, "The size must be a multiple of 32"
    assert \3 ≥ 32 * 256, "The size must be inferior or equal to 8192"

    ld de, \1               ; 3
    ld hl, \2               ; 3

    ld c, (\3 / 32) & 255   ; 2
    .loop\@
        ; copy 32 bytes
        rept 32
        ld a, [de]          ; 2
        ld [hli], a         ; 2
        inc de              ; 2
        endr

        ; check if the copy is over
        dec c               ; 1
        jr nz, .loop\@      ; 3
endm
Listing 4: Optimized batch copy

The cost of a single iteration is 196 cycles (32 * 6 + 1 + 3) for the copy of 32 bytes. There are 256 iterations (8,192 / 32), which makes 50,175 cycles (196 * 256 - 1 for the last jr) for the loop and a grand total of 50,183 cycles when we add the 8 cycles to set the registers before the loop begins. BGB reports 1880E (hex), or 100,366 clocks. Dividing the clocks by 2 yields 50,183 cycles, which is again exactly what was expected. Performance-wise, this method is a massive improvement over the brute force approach.


The Pop Slide

The pop slide is a clever trick that leverages the stack pointer to bypass standard loading instructions. By setting the stack pointer sp to the source data and using pop instead of ld, we can retrieve two bytes at a time, significantly speeding up the transfer. The code for the pop slide is shown is Listing 5.

macro CopyData
    assert \3 & $1F == 0, "The size must be a multiple of 32"
    assert \3 ≥ 32 * 256, "The size must be inferior or equal to 8192"

    ; save the stack pointer
    ld [WRAM_STACK_POINTER], sp       ; 5

    ld sp, \1                         ; 3
    ld hl, \2                         ; 3

    ld c, (\3 / 32) & 255             ; 2
    .loop\@
        ; copy 32 bytes
        rept 16
        pop de                        ; 3
        ld a, e                       ; 1
        ld [hli], a                   ; 2
        ld a, d                       ; 1
        ld [hli], a                   ; 2
        endr

        ; check if the copy is over
        dec c                         ; 1
        jr nz, .loop\@                ; 3

    ; restore the stack pointer
    ld hl, WRAM_STACK_POINTER         ; 3
    ld a, [hli]                       ; 2
    ld h, [hl]                        ; 2
    ld l, a                           ; 1
    ld sp, hl                         ; 2
endm
Listing 5: The pop slide technique

Let's go through measuring the cycles again. The cost of a single iteration is 148 cycles (16 * 9 + 1 + 3) for the copy of 32 bytes. There are 256 iterations (8,192 / 32), which makes 37,887 cycles (148 * 256 - 1 for the last jr) for the loop and a grand total of 37,910 cycles when we add the 13 cycles to set the registers before the loop begins, and the 10 cycles to restore the stack pointer to its original value after the loop. BGB reports 1282C (hex), or 75,820 clocks. Dividing the clocks by 2 yields 37,910 cycles, which is yet again exactly what was expected.

Crucial Note: Because this method overrides the stack pointer, you must disable interrupts (using di) during the copy. If an interrupt occurs, the CPU will attempt to push the program counter onto a stack that no longer points to a valid memory location, leading to crashes or data corruption. If you aren't already running in a context where interrupts are off (like inside the vblank function), be sure to sandwich the routine between di and ei. In the sample, interrupts are already disabled when we perform the copy, because the code is run during the initialization of the sample, which is why we don't use di nor ei.


Conclusion

In this article, we studied three methods to copy data on the Game Boy, with a focus on the pop slide. You should now have a good grasp of how to implement the pop slide. You should also understand its benefits and drawbacks, and be able to use it properly in your games. Table 1 lists up the copy methods along their total cycles cost and cycles per byte.

Table 1: Copy Method Performance
Method Total Cycles Cycles/Byte
Brute force 106,504 13.00
Optimized 50,183 6.13
Pop Slide 37,910 4.63

The pop slide is an excellent tool for intensive vblank tasks. While it requires care regarding interrupts, the speed boost is worth the effort. Note, however, that if you are developing for the Game Boy Color, you should use hardware DMA for VRAM transfers, as it achieves an impressive 2 cycles per byte that cannot be match by any other method.


Art Attribution

The image used in the samples is from xCrossbite.