November 11, 2025 dev-diary testing

When Manual Testing Becomes Unsustainable

Who's this for?

If you've ever had to manually verify dozens (or hundreds) of component variants, this is for you. You should be comfortable with testing frameworks, the rest we'll figure out together.

There’s a moment when you realize your testing process is broken. For me, it was staring at 120 button variations just to verify a change to a single line of CSS.

How did I get here? At work, I maintain multiple design systems for our clients that we use in our products. Each one kept growing — more components, more variants, more states to verify. What started as simple manual checks eventually became that moment.

One of the main challenges I faced was making sure no one could ever break them through a side effect of their changes. This required a rigid testing structure: if the pipeline is red, the PR cannot be merged. If the PR touches any test, it needs careful review, no “LGTM 🚀” drive-bys.

This worked well for behavior testing, those being tests that evaluate the components’ behavior in response to various inputs and interactions. For example something like this:

it(
  'does not call `onClick` when button is in loading state',
  ({ expect }) => {
    const spy = vi.spy()
    const screen = render(<Button status="loading" onClick={spy} />)
 
    userEvent.click(screen.getButton())
 
    expect(spy).not.toHaveBeenCalled()
  }
)

But how do we check that nothing broke visually? This can be solved with visual regression testing, but this testing methodology is infamously known for being painful to set up and manage, so we set it aside and tested everything manually…

Everything was fine for a while, but the design systems kept growing. This happened organically — turns out you can’t build apps with just buttons and accordions — so after a couple of months our initially simple <Button /> ended up with this interface:

interface ButtonProps {
  variant?: "primary" | "secondary" | "destructive" | "link" | "icon"
  size?: "large" | "normal" | "small"
  status?: "disabled" | "loading"
}

Now let’s say we have to change some styling for the "primary" variant, what do we have to check to be sure everything is still fine? Well, all available sizes and states, for a total of 6 possible variations. This might not sound like much, but you have to mentally track each one while also checking hover, active, and focused states. Thankfully Storybook simplifies this, but it’s still a lot of cognitive work and it’s extremely easy to miss something.

What if we have to change a base style tho? We have to check all possible variations of this button and their pseudo-states. This would result in 120 manual checks, one by one, slowly changing props, making sure to not skip any of them, hovering the button, clicking on it, focusing on it with the keyboard. Continuously switching focus from how the button is supposed to look to the controls and interactions.

As you can imagine, manual testing quickly became unsustainable. Checking our changes was exhausting — we spent more time and mental energy verifying than implementing.

I couldn’t expect other developers to be this thorough for what’s essentially a one-line change, so I decided visual regression testing had to become part of our testing strategy. After evaluating options, we landed on Azure App Testing and Playwright as it let us store screenshots directly in the repo, making them part of code review.

So now that everything was in place, the only thing left to do was writing the tests.

Like every good developer, I started writing them. First one was simple enough:

const screen = render(<Button variant="primary" size="small" />)
 
expect(screen.getButton()).toMatchScreenshot()
 
userEvent.hover(screen.getButton())
expect(screen.getButton()).toMatchScreenshot()
 
userEvent.click(screen.getButton())
expect(screen.getButton()).toMatchScreenshot()
 
userEvent.keyPress('tab')
expect(screen.getButton()).toMatchScreenshot()

Now for the other ones left it was enough to copy-paste it over and over while changing the props:

const screen = render(<Button variant="primary" size="normal" />)
/* ... */
 
const screen = render(<Button variant="primary" size="large" />)
/* ... */

After the third one, I stopped. This was too time-consuming, error-prone, and fragile — what happens if we add or remove a prop? And even worse, it was mechanical work… There had to be a better way, otherwise this would turn into a maintenance burden instead of simplifying anything.

And then it finally clicked.

This wasn’t a testing problem anymore — it was a combinatorics problem.

My internal monologue went something like: “wait… I’ve already seen this pattern in school, is this just a cartesian product?”

Once that clicked, everything fell into place.

For those that don’t know, the cartesian product is the set of all possible ordered pairs formed by combining elements from two or more sets.

set A = { 🌶️, 🥬 }
set B = { 👍, 👎 }
 
A ⨯ B = { 🌶️👍, 🌶️👎, 🥬👍, 🥬👎 }

I was onto something. Ideas started flowing: Can this be automated? How much of it? Can I build guardrails to prevent misuse?

I prototyped a framework around it, and it’s been running in production for nearly two years. Which brings me to an important point: I often see developers arguing against building custom solutions because they require maintenance. I disagree.

Since that first implementation, the code has been touched exactly three times: the first when we switched from Playwright on Storybook to Playwright Component Testing, the second when I mocked Date to be pre-set, and the third when we switched to Vitest’s browser mode. The first change required around 4 hours (because of Playwright’s Component Testing complexity), while the third one less than half an hour. No test had to change, at all. If I were to introduce those changes in “raw tests” I believe the estimate would be in weeks.

For the curious, this is what our tests look like:

visualRegressionTest(
  'Button',
  ({
    generators: { variations },
    modifiers: {
      enableHoverTesting,
      enableClickTesting,
      enableFocusTesting,
    },
  }) => [
    variations(
      {
        variant: buttonProperties.variants,
        size: buttonProperties.sizes,
        status: buttonProperties.states,
      },
      {
        modifiers: [
          enableHoverTesting,
          enableClickTesting,
          enableFocusTesting,
        ],
      },
    ),
  ],
)

Conceptually, this function is straightforward: we specify the Button component from all available design systems, define the props we want to test, and add modifiers that capture different states (hover, active, and focused). If we remove all the noise, in just a handful of lines we declaratively generate 30 tests, each producing 4 screenshots. Automatically, with minimal maintenance necessary.

I feel like this fully embraces the “make the right thing easy, and the wrong thing hard” idea. You can do a lot with it, but only within the allowed boundaries, everything else should probably be in a different test.

What can you take from this?

Even experienced developers sometimes miss the obvious.
Your testing strategy should evolve with your requirements and product.
Recognizing patterns you already know can unlock better solutions.

You don’t need to build something as abstracted as my meta-framework. Even just the variations function helps tremendously.

For your convenience, I’ve published it as @wluwd/variations. It’s performant, well tested, and includes the core function plus extras like lazy generators and filtering. You can install it directly, or get this simplified version:

/**
 * Represents all possible variations of key-value pairs from an object, where each key has an array of possible values.
 *
 * Each combination consists of one value from each key's array.
 *
 * @example
 * type Input = {
 *   color: ['red', 'blue'];
 *   size: ['small', 'large'];
 * };
 *
 * type Result = Variations<Input>;
 * //   ^? => { color: "red" | "blue"; size: "small" | "large"; }[]
 */
export type Variations<
	BaseObject extends Record<string, readonly unknown[]>,
> = {
	[key in keyof BaseObject]: BaseObject[key][number]
}[]
 
/**
 * Generates all possible variations of key-value pairs from the given object, where each key has an array of possible values.
 *
 * @param baseObject - The input object where each key has an array of possible values.
 * @returns An array of objects, each representing a unique combination of key-value pairs.
 *
 * @example
 * const input = { color: ['red', 'blue'], size: ['small', 'large'] };
 * const result = variations(input);
 * // result: [
 * //   { color: 'red', size: 'small' },
 * //   { color: 'red', size: 'large' },
 * //   { color: 'blue', size: 'small' },
 * //   { color: 'blue', size: 'large' }
 * // ]
 */
export const variations = <
	const BaseObject extends object,
>(
	baseObject: { [k in keyof BaseObject]: readonly BaseObject[k][] },
): Variations<typeof baseObject> => {
	const entries = Object.entries(baseObject)
 
	let variations: [string, unknown][][] = [[]]
 
	for (const [key, values] of entries) {
		const tmp: [string, unknown][][] = []
 
		for (const v1 of variations) {
			if (Array.isArray(values)) {
				for (const v2 of values) {
					tmp.push([...v1, [key, v2]])
				}
			}
		}
 
		variations = tmp
	}
 
	return variations.map(Object.fromEntries) as Variations<typeof baseObject>
}

Even with only this function, your tests become so much simpler:

it.for(variants<ButtonProps>({
  variant: ['primary', 'secondary', 'link'],
  size: ['small', 'normal', 'large'],
}))(
  "Button / %o",
  (props) => {
    const screen = render(<Button {...props} />)
    /* now take your screenshots! */
  },
)

A word of caution: this approach makes it tempting to test everything in one place — visual regressions, behaviors, interactions, all together. Don’t. Keep visual regression testing separate from behavior testing. They’re two worlds apart. Both should run in your PR pipeline, but locally you’ll want to be selective. When you’ve changed behavior, skip the visual suite. When you’ve changed styling, skip the behavior suite. You wouldn’t mix end-to-end tests with unit tests, right? Same principle applies here.

You know, looking back, the solution was hiding in plain sight — it just took me time to notice it. Sometimes the best abstractions come from recognizing patterns you already know — even if they’re hiding in that math lecture from 10 years ago.