Working in parallel — co-processors

Last lesson we hinted that there was one more chip to introduce — the co-processor. It’s the answer to a problem we keep running into: some kinds of work are important to do quickly but they’re the wrong shape for the CPU.

This lesson is about it.

What the CPU is bad at

The 6502 is a generalist. It has all the basics: load, store, add, subtract, AND, OR, XOR, branch, jump. With those primitives you can write any program. But “any program” doesn’t mean “any program at a useful speed.” A few things the 6502 is famously slow at:

Multiplication. No MUL instruction. To multiply two 16-bit numbers, the CPU loops through 16 bits of one operand, conditionally adding shifted copies of the other. About 300 clock cycles per multiply.
Division. Even worse. Loops, conditional subtracts, bookkeeping. Hundreds of cycles.
Floating-point anything. A 6502 has no concept of fractions (other than the stuff you simulate by hand). Software floating point is thousands of cycles per operation.
A 4×4 matrix multiplied by a 4-element vector — the kind of arithmetic 3D graphics requires — is sixteen multiplies and twelve adds. That’s 16 × 300 + 12 × 5 ≈ 4860 cycles for a single vector. At 1 MHz, that’s ~5 ms. A 60-fps frame has 16 ms total. The CPU is almost out of budget transforming one vector. Forget rendering a scene.

So if you need any of that done quickly, you need help.

A co-processor — built for one thing

A co-processor is a second chip on the same board, sharing the same address bus, data bus, and R/W line. From the CPU’s point of view, it looks like another peripheral in the I/O neighborhood. But where the CPU is general-purpose, a co-processor is a specialist.

Suppose the co-processor is built specifically to do 16-bit multiplications and divisions. Give it two operands and a command. It returns the answer in 4 clock cycles. Same answer the CPU would have gotten in ~300 cycles, but ~75× faster, and (this is the important part) the CPU doesn’t have to wait for it. The CPU hands off the work and goes back to the main program. When the co-processor is done, it pulls the IRQ line — same trick we met in lesson 4 — and the CPU picks up the result.

Where you've seen this pattern before

The cleanest historical example is the Intel 8087 — a math co-processor designed to bolt onto an 8086 or 8088. The 8086 had no hardware floating point. The 8087 was hardware floating point. Programs using floating-point operations would emit special instructions that the 8086 routed to the 8087 over a side channel, and the 8087 would crunch them — taking sometimes 100× less time than software emulation. The two chips ran in parallel. Same shape, just packaged differently.

By the time the 80486 shipped in 1989, Intel had merged the co-processor onto the same die as the CPU and that was the end of the standalone 8087. But the pattern — “this work belongs to a specialist running in parallel” — never went away. It just kept moving:

GPUs are specialists for “do this same operation on millions of pixels at once.”
DMA controllers are specialists for “move this block of bytes from here to there without bothering the CPU.”
Sound chips (the SID on a Commodore 64, modern audio DSPs) are specialists for “generate audio samples on a deadline.”
Network controllers are specialists for “deal with the packet protocol, drop the bytes in this buffer, raise IRQ when ready.”
TPUs / NPUs in modern phones are specialists for “do the matrix multiplications neural networks are made of.”

Every one of those is the same co-processor pattern in different clothes. The principle — give the specialist the work and have it tap you when done — hasn’t changed since 1980.

How you talk to a co-processor

The co-processor is memory-mapped, like the I/O controller. It owns a chunk of addresses — say $E000–$E00F — and uses each address for a different purpose. The shape might look like:

$E000  Operand A, low byte    ── CPU writes ──┐
$E001  Operand A, high byte                   │  inputs
$E002  Operand B, low byte                    │
$E003  Operand B, high byte                   │
                                              │
$E004  Command                ── CPU writes ──┤  $01 = MUL, $02 = DIV, ...
                                              │  writing here also kicks the co-processor off
$E005  Status                 ── CPU reads  ──┘  bit 7 = busy, 0 = ready
                                              │
$E006  Result, byte 0         ── CPU reads  ──┐
$E007  Result, byte 1                         │  outputs
$E008  Result, byte 2                         │  (32-bit answer)
$E009  Result, byte 3         ── CPU reads  ──┘

This is a real pattern — almost every memory-mapped accelerator on every system from the 1980s onward looks structurally like this. A few “input” registers, a “command” register that doubles as a “go” trigger, a status flag, and “output” registers. The CPU writes the inputs, writes the command, then either polls the status or waits for an interrupt.

The conversation, step by step

A 16×16 multiply via the co-processor looks like:

multiply_via_coproc:
  ; CPU writes the operands into the co-processor's input registers.
  LDA op_a_lo
  STA $E000
  LDA op_a_hi
  STA $E001
  LDA op_b_lo
  STA $E002
  LDA op_b_hi
  STA $E003

  ; CPU writes the MUL command. This also kicks the co-processor off.
  LDA #$01
  STA $E004

  ; ... do other useful work here while the co-processor churns ...

  ; Eventually, poll the status (or get IRQ'd):
wait:
  LDA $E005
  BMI wait        ; bit 7 set means "busy" — keep waiting

  ; Done. Read out the 32-bit result.
  LDA $E006
  STA result+0
  LDA $E007
  STA result+1
  LDA $E008
  STA result+2
  LDA $E009
  STA result+3
  RTS

Same LDA and STA we’ve used in every other lesson — the co-processor doesn’t need any new instructions. The STA $E004 line that fires the command is structurally identical to “turn on an LED at $D000” from lesson 4. The co-processor just happens to do something when you write to its command register, instead of just lighting a lamp.

Two ways to wait

The example above shows polling: the CPU reads the status register in a loop until the co-processor says “done.” That’s fine if the CPU has nothing better to do, but it wastes the whole point of the co-processor — which is that the CPU could be doing other useful work while it runs.

The better pattern uses interrupts:

CPU writes the operands.
CPU writes the command. The co-processor starts churning.
CPU goes back to the main loop and does whatever else the program needs to do — drawing a frame, handling input, running game logic.
When the co-processor finishes, it asserts IRQ.
The CPU’s IRQ handler reads the co-processor’s result registers, stashes the answer somewhere the main loop will see it (queue it up — see lesson 4), and returns.
The main loop picks up the answer when it’s ready for it.

That’s the whole reason for the IRQ-driven design we built up to. The CPU and the co-processor run in parallel.

Timeline The CPU writes operands, fires the GO command, then keeps working on the main program. The co-processor churns in parallel. When it's done, it raises IRQ; the CPU briefly drops in, grabs the result, and keeps going.

CPU

CO-PROC

CPU writes operands → fires GO Co-proc raises IRQ → CPU grabs result

The green lane is the CPU running the main program — uninterrupted, no pause. The purple lane is the co-processor doing its job. The dashed line at the start marks “CPU writes the operands and fires GO.” The line at the end marks “co-processor raises IRQ → CPU grabs the result.” The gap between them is parallel time — the CPU kept working the whole time the co-processor was working. That’s the whole point.

Compare with the lesson 4 timeline, where the CPU paused during the ISR. Here the ISR is just a tiny dip at the very end to grab the co-processor’s output. The big work happened in parallel.

Other co-processors

The mechanism is so general that almost every interesting peripheral on any system uses it. Some examples worth knowing about:

DMA — Direct Memory Access. A specialist whose entire job is “copy this block of memory from here to there as fast as the bus can carry it.” Game consoles use DMA to copy a freshly-prepared framebuffer into video memory during vblank. The CPU writes “copy N bytes from $2000 to $8000,” fires GO, and walks away. DMA raises IRQ when done. The CPU spent 5 cycles to dispatch what would have taken 5N cycles to do by hand.
Sound chips (SID, NES APU, AY-3-8910, modern audio DSPs). The CPU writes “play note F4 on channel 1 with this envelope.” The chip generates the actual audio samples on its own. The CPU visits maybe 60 times a second to update the parameters. The sound chip is doing thousands of multiplications per visit, all in parallel.
Network interface controllers. A modern Ethernet chip is the same pattern with a much fancier wardrobe. Packets arrive over the wire, the chip drops them into a ring buffer in RAM via DMA, raises an IRQ. The CPU’s interrupt handler queues the packet for the network stack and returns. The CPU never touched the actual wire-level signaling.
GPUs. The most extreme version: hundreds to thousands of arithmetic units running in parallel, each doing the same operation on different data. The CPU writes “draw this triangle with this texture” and the GPU does millions of pixel-level decisions before raising the equivalent of IRQ.

In every case, the shape is identical to what you just read: memory-mapped registers, write the inputs, fire a command, signal when done, read the output. The 6502 era taught us this pattern. The industry has never stopped using it.

What’s next

That’s the whole system.

CPU — the chip in charge, the conductor.
Co-processor — the specialist running alongside the CPU.
RAM — the working memory that holds values.
ROM — the program memory that survives a power cycle.
Clock — the heartbeat that never stops.
I/O controller — the chip that talks to the outside world.
Video chip — the display, showing it all off.

Six chips and a clock signal. That’s it. From here you can build a working computer that runs real programs. Every concept stacks on the ones below it: bits → bytes → words → bit ops → arithmetic → shifts → CPU registers → instructions → memory → ROM → I/O → interrupts → display → co-processors.

Next, the series turns from what each piece is into how they all work together to run real code. That’s where your 6502 simulator takes over — same vocabulary, real silicon-accurate execution, and the actual programs that ran on the actual chips that shaped the industry.