Engineering

Modeling what makes paper-folding puzzles hard

I built a daily paper-folding puzzle game and needed a way to generate puzzles with a consistent difficulty curve. Simulating the folds was straightforward. Figuring out why some puzzles feel impossible while others click instantly was not.

Mert Aslan·March 2026

You've probably done this as a kid. Take a piece of paper, fold it a couple of times, punch a hole through the layers, then unfold it and see where all the holes ended up. It's fun. It's also, as it turns out, a standardized cognitive test.

The paper folding test shows up in real psychometric testing. The most direct version is Ekstrom et al.'s VZ-2 “Paper Folding Test” from their 1976 Kit of Factor-Referenced Cognitive Tests. It measures spatial visualization, your ability to mentally manipulate 2D objects through sequences of transformations. When I started building, I assumed difficulty would scale with grid size and fold count. After a lot of playtesting, that turned out to be only part of the story.

I turned this into Daily Unfold, a daily puzzle game. Three difficulties each day: easy is a 4x4 grid with one fold, medium is 6x6 with two folds, hard is 6x6 with three folds and two punch holes. The game generates puzzles deterministically from the date, so everyone in the world gets the same three puzzles. But making that difficulty curve feel right? That was the real problem.

Simulating the folds

The core engine runs a forward simulation. For every cell in the original unfolded grid, it traces that cell through each fold one by one. If a cell is on the folding side, its coordinate gets mirrored across the fold line:

// For a horizontal fold at position p:
if (row <= p) {
  row = 2 * p + 1 - row;  // mirror
}
row = row - (p + 1);        // shift to new origin

The smaller side always folds onto the larger side. After mirroring, coordinates get shifted so the remaining visible portion starts at zero. Each fold shrinks the effective grid, and the next fold operates on whatever is left.

Once all folds are applied, the engine checks if the cell's final position matches any punch location. If so, that cell has a hole in the unfolded paper. With three folds, a single punch can go through up to 2³ = 8 layers, producing up to 8 holes.

Fold at center
Punch through layers
Unfolded: 2 holes

A center fold creates a clean mirror. The punch at (2,1) maps to (1,1) on the other side.

How I model difficulty

My first approach was obvious: bigger grid, more folds, harder puzzle. That produced a lot of puzzles that felt wrong. A 6x6 with two center folds could feel trivial, while a 4x4 with one weird fold stumped people. I needed something better.

After reading some spatial reasoning research and doing a lot of playtesting, I landed on a scoring function that weights six factors. The weights are hand-tuned, not learned from data. Here they are, ordered by weight:

1Off-center foldsweight: 4x

When a fold runs through the center of the grid, the result is perfectly symmetric. You can just see the mirror and you're done. In playtesting, center-fold puzzles got solved fast and with few errors.

Shift the fold line off-center and that shortcut disappears. Now the two sides are different sizes. One side wraps around the other unevenly. Some cells end up stacking, others don't. You can't just “mirror it” anymore. You have to trace each cell through the fold individually. Humans are very good at symmetry detection, and off-center folds take that tool away.

Center fold: holes mirror evenlyEasy to reason about
Off-center fold, asymmetricBreaks the symmetry shortcut

2Hole spreadweight: 4x

When all the holes cluster in one corner, you can kind of focus your attention there and ignore the rest. When they scatter across every row and every column, you can't chunk the problem anymore. Every single cell in the grid needs its own evaluation.

The engine measures this as the fraction of rows and columns that contain at least one hole. A spread of 1.0 means holes touch every row and every column. High spread forces you to think about the entire grid instead of just a region.

3Mixed axesweight: 3x

If every fold is horizontal, you're basically doing 1D reasoning. Rows mirror up and down, columns stay fixed. Easy enough. Throw in a vertical fold and suddenly you're tracking transformations in two directions at once. Working memory fills up fast.

For hard puzzles, the engine actually forces mixed axes during generation. If all the randomly generated folds happen to land on the same axis, it flips one of them so you're guaranteed to need 2D spatial reasoning.

4Punch countweight: 2.5x

Each punch creates its own independent set of holes. Two punches means two separate tracing tasks through the same fold sequence. Easy and medium puzzles get one punch. Hard gets two, which roughly doubles the work without adding any new mechanical complexity.

5Fold countweight: 2x

More folds means more transformations to simulate in your head. In playtesting, adding a second center fold felt less disruptive than making one fold off-center. That said, at three folds the chaining alone gets taxing regardless of position.

6Hole count & grid sizeweight: 1x / 2x

More holes means more to find, and a 6x6 grid gets a flat 2-point bonus over 4x4 for the larger search space. These are the obvious difficulty knobs, but in practice a puzzle with many holes from a center fold still felt easier than one with few holes from an off-center fold.

The general pattern: factors that disrupt mental shortcuts (off-center folds, scattered holes) got higher weights than factors that just add more stuff to process (hole count, grid size). The scoring function cares more about how a puzzle is structured than how much stuff is in it.

The scoring formula

score = holes      x 1.0
      + folds      x 2.0
      + offCenter  x 4.0
      + mixedAxes  x 3.0
      + punches    x 2.5
      + spread     x 4.0
      + gridBonus  x 2.0

Easy puzzles typically score 8-12. Medium lands around 13-22. Hard puzzles score 22+.

Generating puzzles that feel right

Every puzzle is generated deterministically from the date. Today's hard puzzle comes from hashing the string UNFOLD_HARD_2026-03-25. Same date, same seed, same puzzle, no matter what device you're on.

But random generation doesn't always produce good puzzles. Sometimes a “hard” puzzle rolls lucky folds and ends up feeling easier than the medium. Sometimes the hole count is boring. So there's a validation step after generation.

After generating all three puzzles for the day, the engine checks that the cognitive scores are properly ordered: hard > medium > easy. If something is out of order, it regenerates the offending puzzle with an offset seed: UNFOLD_HARD_2026-03-25_v1, then _v2, up to 10 retries per difficulty.

The rerolling itself is deterministic. The version suffix is part of the seed, so the search for a good puzzle follows the exact same path every time. No server, no database. Just a hash function and a PRNG.

// The full generation pipeline
const seed = hash("UNFOLD_HARD_2026-03-25");
const rng  = new SeededRandom(seed);

let puzzle = generate(rng, "hard");
let score  = cognitiveScore(puzzle);

// Reroll if the difficulty curve is wrong
for (let v = 1; score < mediumScore && v <= 10; v++) {
  const offsetSeed = hash("UNFOLD_HARD_2026-03-25_v" + v);
  puzzle = generate(new SeededRandom(offsetSeed), "hard");
  score  = cognitiveScore(puzzle);
}

Hard puzzles also have structural constraints baked in. The engine guarantees at least one off-center fold (to break symmetry) and folds on both axes (to require 2D reasoning). If the random generation doesn't satisfy these, it mutates the folds, flipping an axis or shifting a position, then revalidates.

LevelGridFoldsPunchesOff-centerMixed axesScore range
Easy4x411non/a≤ 13
Medium6x621possiblepossible13 - 22
Hard6x632forcedforced≥ 22

Where this falls apart

The weights are hand-tuned. There's no regression, no training data, no formal validation. The spatial cognition literature gave me a starting direction (asymmetry and mixed-axis transformations seem disproportionately hard), and playtesting gave me the specific numbers.

There are definitely puzzles the model scores as “medium” that players find brutal. This usually happens when the fold sequence produces a hole pattern that doesn't match any simple mental template. Template-matching (when you recognize a pattern and skip the reasoning entirely) is a cognitive strategy the model doesn't capture at all.

It also ignores practice effects. Someone who has played 50 days builds fold-specific intuitions that a new player doesn't have. What feels hard on day 1 might be routine by day 30. You could build an adaptive system that adjusts to individual skill, but that breaks the “everyone gets the same puzzle” property that makes daily games fun to share.

With enough data (thousands of solves with timing), you could train a real model. Maybe a regression on solve time, or something that predicts error patterns per fold configuration. For now the hand-tuned heuristic gets the job done. The difficulty curve feels smooth for most players, and that's what matters.

One PRNG, no server

The whole thing runs client-side. No server generates puzzles, no database stores them. A date string gets hashed into a 32-bit integer, which seeds a mulberry32 PRNG. It's tiny (8 lines of JavaScript), fast, and has good distribution. Every random decision during generation (axis choice, fold position, punch location) comes from this one sequence.

The result is a fully deterministic system. Same date, same puzzles, every device, every browser, no network request. The rerolling mechanism works because appending _v1 to the seed string doesn't retry randomly. It follows a specific, reproducible alternative path through the generation space.

Further reading

If spatial cognition interests you:

  • Ekstrom et al. (1976) Kit of Factor-Referenced Cognitive Tests. The VZ-2 “Paper Folding Test” is essentially this game. It's the direct ancestor and the reason I knew the task was a real measure of spatial ability.
  • Shepard & Metzler (1971) Mental Rotation of Three-Dimensional Objects. Not about paper folding specifically, but the foundational work on mental spatial transformations. Their finding that rotation time scales linearly with angle is what got me thinking about fold count as a linear cost factor.

★★★

See if the difficulty curve matches your experience

Three new puzzles every day. Easy, medium, hard.

Play Daily Unfold