Hypothesis Property Testing: Find Edge Cases Automatically

Contents

Property-based testing with Hypothesis lets you define what your code must do. One classic rule: “encode, then decode, and you get the same input back.” Hypothesis then makes up hundreds of random inputs and hunts for cases that break the rule. You don’t write test cases by hand. You sketch the shape of valid inputs. The tool finds the off-by-one bugs, the odd Unicode strings, and the edge cases hiding in your code.

Install it with pip install hypothesis. Write strategies to sketch your input space, tag your test with @given(), and let the fuzzer do the work. Hypothesis is on version 6.151.x. It plugs into pytest and unittest, runs on any CI, and has found bugs in CPython’s standard library, NumPy, and dozens of other open-source projects.

Why Example-Based Tests Leave Gaps

A typical unit test for a sort function might check that sort([3, 1, 2]) returns [1, 2, 3] and sort([]) returns []. That covers two cases. But what about negative numbers? Duplicates? Very large lists? Lists with float('nan')? Lists of one element? These are the inputs that cause real bugs. They are also the inputs most developers forget to test.

The core problem with example-based testing is selection bias. You pick inputs that match your mental model of how the code works. If you wrote the code, you already have blind spots: the same blind spots that produced the bug in the first place.

Property-based testing flips the approach. Instead of listing input-output pairs, you list rules that must hold for all valid inputs. For a sort function, those rules might be:

The output has the same length as the input
Each item at index i is no greater than the item at i+1
The output holds the same items as the input (same multiset)

Hypothesis then makes up random inputs: ints, strings, nested data, whatever you describe. It checks each one against your rules. When it finds an input that breaks a rule, it does not just print a random failing case. It shrinks the input to the smallest example it can find. You get a minimal reproducing case, and debugging gets much easier.

Property-based testing workflow diagram showing the steps: define strategy, generate inputs, check property, shrink on failure, and report minimal example

Four ideas sit at the core of Hypothesis. Strategies like st.integers(), st.text(), and st.lists() make up random data. Properties are test functions wrapped with @given(). Shrinking cuts a failing input down to a small example. And the Hypothesis database (the .hypothesis/ folder) keeps each failing example it has found and replays them on later runs. Once a bug is found, it stays in your suite even after the random seed moves on.

Core Strategies and the @given Decorator

Strategies are the building blocks of Hypothesis. Each one describes a space of possible values. Hypothesis samples from that space while the test runs.

Basic Strategies

The most common built-in strategies cover Python’s primitive types:

from hypothesis import given, strategies as st

# Integers with optional bounds
@given(st.integers())
def test_abs_non_negative(n):
    assert abs(n) >= 0

# Floats, often excluding NaN for numeric properties
@given(st.floats(allow_nan=False, allow_infinity=False))
def test_float_round_trip(x):
    assert float(str(x)) == x or x != x

# Text with constraints on alphabet and length
@given(st.text(min_size=1, alphabet=st.characters(categories=("L", "N"))))
def test_non_empty_string(s):
    assert len(s) > 0

# Booleans, binary data, None
@given(st.booleans())
def test_bool_is_int(b):
    assert isinstance(b, int)

Collection Strategies

Collections compose basic strategies into more complex structures:

# Lists of integers, bounded in size
@given(st.lists(st.integers(), min_size=0, max_size=100))
def test_sort_preserves_length(lst):
    assert len(sorted(lst)) == len(lst)

# Dictionaries with string keys and integer values
@given(st.dictionaries(st.text(min_size=1), st.integers()))
def test_dict_keys(d):
    for key in d:
        assert isinstance(key, str)

# Tuples with mixed types
@given(st.tuples(st.integers(), st.text()))
def test_tuple_unpacking(t):
    n, s = t
    assert isinstance(n, int)
    assert isinstance(s, str)

Composing and Transforming Strategies

You can combine strategies in several ways:

st.one_of(st.integers(), st.text()) draws from either strategy (union types)
strategy.map(func) transforms drawn values. For example, st.integers().map(abs) gives only zero or positive ints
strategy.filter(predicate) trims values. Use it sparingly. Filters that reject too much raise Unsatisfied errors
st.builds(MyClass, name=st.text(), age=st.integers(min_value=0)) builds class instances by passing strategies as args

Controlling Test Volume

The @settings decorator controls how many inputs Hypothesis tries:

from hypothesis import given, settings, strategies as st

@given(st.lists(st.integers()))
@settings(max_examples=500)
def test_sort_idempotent(lst):
    assert sorted(sorted(lst)) == sorted(lst)

The default is 100 examples per test. You can pin key inputs alongside the random ones with @example:

from hypothesis import given, example, strategies as st

@given(st.lists(st.integers()))
@example([])        # always test empty list
@example([0, 0, 0]) # always test duplicates
def test_sort_preserves_elements(lst):
    result = sorted(lst)
    assert sorted(result) == result

Writing Effective Properties

Picking which rules to assert is where most people get stuck. The tools are simple. The hard part is turning “my code should be correct” into a concrete, testable rule. The patterns below cover the most common cases.

Round-Trip (Inverse) Property

If you have an encode/decode pair, the round-trip property is the most natural fit:

import json
from hypothesis import given, strategies as st

@given(st.dictionaries(st.text(), st.integers()))
def test_json_round_trip(data):
    assert json.loads(json.dumps(data)) == data

This pattern fits serializers like JSON, msgpack, and protobuf. It fits encoders like base64 and URL encoding. It fits compression and crypto. Any two functions that undo each other are a fit.

Idempotence Property

A function is idempotent if applying it twice gives the same result as applying it once:

@given(st.text())
def test_strip_idempotent(s):
    assert s.strip().strip() == s.strip()

This works for formatting, Unicode normalization (NFC), dedup, URL clean-up, and HTML sanitization. If f(f(x)) == f(x) doesn’t hold, you have a bug.

Invariant Preservation

After an operation, some structural properties must still hold:

@given(st.lists(st.integers(), min_size=1))
def test_sort_ordered(lst):
    result = sorted(lst)
    for i in range(len(result) - 1):
        assert result[i] <= result[i + 1]

More cases: after a push into a balanced BST, the tree stays balanced. After a database migration , row counts are kept. After adding an item to a cart, the total stays at zero or above.

Oracle (Differential) Testing

Compare your code against a known-correct reference:

import statistics
from hypothesis import given, strategies as st

@given(st.lists(st.floats(allow_nan=False, allow_infinity=False), min_size=1))
def test_custom_median(data):
    assert my_fast_median(data) == statistics.median(data)

Use this to check fast code against slower but trusted reference code. Use it to test C extensions against their pure-Python fallbacks.

Commutativity and Associativity

For operations that should be order-independent:

from hypothesis import given, strategies as st

@given(st.integers(), st.integers())
def test_addition_commutative(a, b):
    assert a + b == b + a

Handy for custom numeric types, merge ops, and set-like data structures.

Using assume() for Preconditions

When inputs must meet a condition, use assume() to skip the bad ones:

from hypothesis import given, assume, strategies as st

@given(st.integers(), st.integers())
def test_division(a, b):
    assume(b != 0)
    assert (a // b) * b + (a % b) == a

Prefer to constrain the strategy itself when you can (e.g., st.integers().filter(lambda x: x != 0) or st.integers(min_value=1)). assume throws inputs out after they’re drawn, which is wasteful.

Custom Strategies with @st.composite

Some domain data can’t be built from basic strategy combinators. For those, the @st.composite decorator lets you write plain Python code that draws from other strategies:

from hypothesis import strategies as st

@st.composite
def ordered_pairs(draw):
    """Generate a tuple (a, b) where a <= b."""
    a = draw(st.integers())
    b = draw(st.integers(min_value=a))
    return (a, b)

@st.composite
def user_records(draw, min_age=0, max_age=150):
    """Generate realistic user records."""
    name = draw(st.text(min_size=1, max_size=50,
                        alphabet=st.characters(categories=("L",))))
    age = draw(st.integers(min_value=min_age, max_value=max_age))
    email = draw(st.emails())
    return {"name": name, "age": age, "email": email}

The draw function pulls a value from any strategy. You can use if-statements, loops, and ties between drawn values. This is how you build strategies for complex domain objects: database rows, API payloads, graph structures.

Shrinking, Reproducing, and Debugging Failures

When Hypothesis finds a failing input, it goes one step further. It shrinks that input to the smallest counterexample it can find.

How Shrinking Works

Say st.lists(st.integers()) draws [483, -29, 0, 17, -6, 92] and the test fails. Hypothesis then tries smaller lists and simpler integers. It keeps going until it lands on something like [0, -1] that still triggers the bug. The shrink is deterministic and cached. Running the test again replays the exact minimal failing input from .hypothesis/examples/ before any new random inputs.

Diagram showing how Hypothesis shrinks a failing input from a complex list to a minimal reproducing example

The payoff is bigger than it sounds. A raw failing input of [483, -29, 0, 17, -6, 92] tells you almost nothing about the root cause. A shrunk input of [0, -1] points straight at zero or negative numbers. That is a very different debugging experience.

Reproducing Failures

Hypothesis provides several mechanisms for reproduction:

The .hypothesis/ database folder keeps every failing example it has found and replays them on its own
The @reproduce_failure(...) decorator is printed in the error output. Paste it into the test file to pin the exact failing input
@settings(verbosity=Verbosity.verbose) shows every input Hypothesis tries, both while drawing and while shrinking

Stateful Testing with RuleBasedStateMachine

Some systems hold state: APIs, databases, data structures with many ops. For those, Hypothesis ships RuleBasedStateMachine. Instead of testing one function at a time, you define each op as a rule. Hypothesis then runs random sequences of those rules:

from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
from hypothesis import strategies as st

class StackMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.stack = []
        self.model = []  # reference implementation

    @rule(value=st.integers())
    def push(self, value):
        self.stack.append(value)
        self.model.append(value)

    @rule()
    def pop(self):
        if self.model:
            expected = self.model.pop()
            actual = self.stack.pop()
            assert actual == expected

    @invariant()
    def lengths_match(self):
        assert len(self.stack) == len(self.model)

TestStack = StackMachine.TestCase

Hypothesis runs random runs of push and pop calls. When a run fails, it shrinks both the inputs and the call order. You get a minimal reproducing trace. This catches bugs that only show up after a specific run of calls. That is the kind of bug you can almost never cover by hand.

Integrating Hypothesis into a Real Project

Hypothesis pays off most when it runs on every commit. The usual setup uses profiles: a light one for local dev and a heavy one for CI.

Setting Up Profiles

Create a conftest.py with one profile per environment:

# conftest.py
from hypothesis import settings, HealthCheck, Verbosity

settings.register_profile(
    "ci",
    max_examples=1000,
    deadline=None,  # CI runners are slow, avoid spurious timeouts
)
settings.register_profile(
    "dev",
    max_examples=50,
    deadline=200,   # fast feedback during development
)
settings.register_profile(
    "debug",
    max_examples=10,
    verbosity=Verbosity.verbose,
)

Select the profile via environment variable: HYPOTHESIS_PROFILE=ci pytest.

CI Configuration

In GitHub Actions or GitLab CI:

Run with the ci profile and deadline=None so slow runners don’t trigger spurious fails
Cache the .hypothesis/ folder as a CI artifact so found bugs stick around across runs
Hypothesis is safe to run in parallel with pytest-xdist (-n auto). Each worker gets its own random seed

# .github/workflows/test.yml (relevant excerpt)
- name: Run tests
  env:
    HYPOTHESIS_PROFILE: ci
  run: pytest -n auto --hypothesis-seed=0
- uses: actions/cache@v4
  with:
    path: .hypothesis
    key: hypothesis-${{ runner.os }}-${{ hashFiles('tests/**') }}

Auto-Generating Test Stubs

The hypothesis write CLI builds property-based test stubs from type hints:

hypothesis write mymodule.my_function

It reads the function’s type hints and spits out a test that uses guessed strategies. Handy for bootstrapping property tests across an old codebase without writing every strategy by hand.

Health Checks and Performance

Hypothesis ships health checks that warn you when a strategy is too slow or filters too much. Turn them off only when you truly need to:

@settings(suppress_health_check=[HealthCheck.too_slow])
@given(expensive_strategy())
def test_with_slow_setup(data):
    ...

For coverage reports, pair Hypothesis with pytest-cov. pytest --cov --hypothesis-seed=0 pins the seed so coverage numbers stay stable. In practice, property tests tend to lift branch coverage by 10-20% over example tests alone. They reach input combos you would never write by hand.

Property-Based Testing Across Languages

Hypothesis is the top library for Python. The idea started with QuickCheck in Haskell. It has spread to most major languages:

Language	Library	Key Difference
Haskell	QuickCheck	The original. Generates and shrinks based on type alone.
Python	Hypothesis	Strategy-based generation with integrated shrinking and database.
JavaScript/TypeScript	fast-check	Inspired by QuickCheck and Hypothesis, strong TypeScript support.
Rust	proptest	Inspired by Hypothesis. Strategies are aware of constraints, avoiding rejected inputs.

QuickCheck ties draws and shrinks to each type. That is simpler but less flexible. Hypothesis and proptest tie them to each strategy. You can bake limits into the strategy rather than filter after the draw. fast-check uses a similar per-strategy model with strong TypeScript support.

If you work in many languages, the mental model carries over. Round-trip tests, idempotence checks, and oracle tests work the same way in every library. Only the syntax and the strategy names change.

When Property-Based Testing Fits and When It Does Not

Property-based testing fits best when a function has clear math rules and a tight input range, and gives the same output for the same input. Pure functions, data transforms, parsers, serializers, and algorithms are prime fits.

It is less useful for code that is mostly side effects, like sending emails or writing to a database. It is also weak when the “right” output is hard to pin down without rebuilding the function. The same goes for GUI flows. For those, plain example-based tests or tests that exercise real dependencies work better.

A good rule: start with property-based tests for your core logic. That is, the code that transforms data, validates inputs, or runs algorithms. Keep example-based tests for end-to-end flows. Even a handful of @given tests bolted onto a mature codebase tends to find bugs that years of example-based testing missed.