How to Use Property-Based Testing in Python with Hypothesis

Property-based testing with Hypothesis lets you define the properties your code must satisfy - such as “encoding then decoding always returns the original input” - and then automatically generates hundreds or thousands of randomized test inputs to find counterexamples. Instead of writing individual test cases by hand, you describe the shape of valid inputs and let the framework discover the off-by-one errors, Unicode edge cases, and boundary conditions hiding in your code.

Install it with pip install hypothesis, write strategies to describe your input space, decorate your test function with @given(), and let the fuzzer do the work. Hypothesis (currently at version 6.151.x as of early 2026) integrates natively with pytest and unittest, runs on any CI system, and has found bugs in CPython’s standard library, NumPy, and dozens of other major open-source projects.

Why Example-Based Tests Leave Gaps

A typical unit test for a sorting function might check that sort([3, 1, 2]) returns [1, 2, 3] and sort([]) returns []. That covers two cases. But what about negative numbers? Duplicates? Very large lists? Lists containing float('nan')? Lists with a single element? These are exactly the inputs that cause real-world bugs, and they are exactly the inputs most developers forget to test.

The fundamental problem with example-based testing is selection bias. You pick inputs that match your mental model of how the code works. If you wrote the code, you already have blind spots - the same blind spots that produced the bug in the first place.

Property-based testing inverts the approach. Instead of specifying input-output pairs, you specify invariants that must hold for all valid inputs. For a sort function, those invariants might be:

The output has the same length as the input
Every element at index i is less than or equal to the element at index i+1
The output contains exactly the same elements as the input (same multiset)

Hypothesis then generates random inputs - integers, strings, nested data structures, whatever you describe - and checks whether those invariants hold. When it finds an input that violates a property, it does not just report a random failing case. It shrinks the input to the smallest possible counterexample, giving you a minimal reproducing case that makes debugging straightforward.

Four concepts make up the core of Hypothesis. Strategies like st.integers(), st.text(), and st.lists() generate random data. Properties are test functions decorated with @given(). Shrinking automatically reduces a failing input to the minimal reproducing example. And the Hypothesis database (the .hypothesis/ directory) stores previously found failing examples and replays them in future test runs, so once a bug is found it stays in your regression suite even after the randomness moves on.

Core Strategies and the @given Decorator

Strategies are the building blocks of Hypothesis. Each one describes a space of possible values, and Hypothesis samples from that space during test execution.

Basic Strategies

The most common built-in strategies cover Python’s primitive types:

from hypothesis import given, strategies as st

# Integers with optional bounds
@given(st.integers())
def test_abs_non_negative(n):
    assert abs(n) >= 0

# Floats, often excluding NaN for numeric properties
@given(st.floats(allow_nan=False, allow_infinity=False))
def test_float_round_trip(x):
    assert float(str(x)) == x or x != x

# Text with constraints on alphabet and length
@given(st.text(min_size=1, alphabet=st.characters(categories=("L", "N"))))
def test_non_empty_string(s):
    assert len(s) > 0

# Booleans, binary data, None
@given(st.booleans())
def test_bool_is_int(b):
    assert isinstance(b, int)

Collection Strategies

Collections compose basic strategies into more complex structures:

# Lists of integers, bounded in size
@given(st.lists(st.integers(), min_size=0, max_size=100))
def test_sort_preserves_length(lst):
    assert len(sorted(lst)) == len(lst)

# Dictionaries with string keys and integer values
@given(st.dictionaries(st.text(min_size=1), st.integers()))
def test_dict_keys(d):
    for key in d:
        assert isinstance(key, str)

# Tuples with mixed types
@given(st.tuples(st.integers(), st.text()))
def test_tuple_unpacking(t):
    n, s = t
    assert isinstance(n, int)
    assert isinstance(s, str)

Composing and Transforming Strategies

You can combine strategies in several ways:

st.one_of(st.integers(), st.text()) generates values from either strategy (union types)
strategy.map(func) transforms generated values - for example, st.integers().map(abs) generates non-negative integers
strategy.filter(predicate) constrains values, though use this sparingly since filters that reject too many values raise Unsatisfied errors
st.builds(MyClass, name=st.text(), age=st.integers(min_value=0)) constructs class instances by mapping strategies to constructor parameters

Controlling Test Volume

The @settings decorator controls how many inputs Hypothesis tries:

from hypothesis import given, settings, strategies as st

@given(st.lists(st.integers()))
@settings(max_examples=500)
def test_sort_idempotent(lst):
    assert sorted(sorted(lst)) == sorted(lst)

The default is 100 examples per test. You can pin specific important inputs alongside the generated ones using @example:

from hypothesis import given, example, strategies as st

@given(st.lists(st.integers()))
@example([])        # always test empty list
@example([0, 0, 0]) # always test duplicates
def test_sort_preserves_elements(lst):
    result = sorted(lst)
    assert sorted(result) == result

Writing Effective Properties

Choosing which properties to assert is where most people get stuck. The tooling is straightforward, but translating “my code should be correct” into a concrete, testable invariant takes practice. These patterns cover the most common scenarios.

Round-Trip (Inverse) Property

If you have an encode/decode pair, the round-trip property is the most natural fit:

import json
from hypothesis import given, strategies as st

@given(st.dictionaries(st.text(), st.integers()))
def test_json_round_trip(data):
    assert json.loads(json.dumps(data)) == data

This pattern works for serializers (JSON, msgpack, protobuf), encoders (base64, URL encoding), compression, and encryption/decryption. Any pair of functions where one is the inverse of the other is a candidate.

Idempotence Property

A function is idempotent if applying it twice gives the same result as applying it once:

@given(st.text())
def test_strip_idempotent(s):
    assert s.strip().strip() == s.strip()

This applies to formatting functions, Unicode normalization (NFC), deduplication, URL canonicalization, and HTML sanitization. If f(f(x)) == f(x) does not hold, you have a bug.

Invariant Preservation

After an operation, certain structural properties must still hold:

@given(st.lists(st.integers(), min_size=1))
def test_sort_ordered(lst):
    result = sorted(lst)
    for i in range(len(result) - 1):
        assert result[i] <= result[i + 1]

Other examples: after inserting into a balanced BST, the tree is still balanced. After a database migration, row counts are preserved. After adding an item to a cart, the total is non-negative.

Oracle (Differential) Testing

Compare your implementation against a known-correct reference:

import statistics
from hypothesis import given, strategies as st

@given(st.lists(st.floats(allow_nan=False, allow_infinity=False), min_size=1))
def test_custom_median(data):
    assert my_fast_median(data) == statistics.median(data)

You can use this to validate optimized implementations against slower but trusted reference code, or to test C extensions against pure-Python fallbacks.

Commutativity and Associativity

For operations that should be order-independent:

from hypothesis import given, strategies as st

@given(st.integers(), st.integers())
def test_addition_commutative(a, b):
    assert a + b == b + a

Useful for custom numeric types, merge operations, and set-like data structures.

Using assume() for Preconditions

When inputs must satisfy certain conditions, use assume() to skip invalid ones:

from hypothesis import given, assume, strategies as st

@given(st.integers(), st.integers())
def test_division(a, b):
    assume(b != 0)
    assert (a // b) * b + (a % b) == a

Prefer constraining the strategy itself when possible (e.g., st.integers().filter(lambda x: x != 0) or st.integers(min_value=1)), since assume discards inputs after generation.

Custom Strategies with @st.composite

For domain-specific data that cannot be expressed with basic strategy combinators, the @st.composite decorator lets you write arbitrary Python code that draws from other strategies:

from hypothesis import strategies as st

@st.composite
def ordered_pairs(draw):
    """Generate a tuple (a, b) where a <= b."""
    a = draw(st.integers())
    b = draw(st.integers(min_value=a))
    return (a, b)

@st.composite
def user_records(draw, min_age=0, max_age=150):
    """Generate realistic user records."""
    name = draw(st.text(min_size=1, max_size=50,
                        alphabet=st.characters(categories=("L",))))
    age = draw(st.integers(min_value=min_age, max_value=max_age))
    email = draw(st.emails())
    return {"name": name, "age": age, "email": email}

The draw function pulls a value from any strategy, and you can use conditional logic, loops, and data dependencies between drawn values. This is how you build strategies for complex domain objects like database rows, API request payloads, or graph structures.

Shrinking, Reproducing, and Debugging Failures

When Hypothesis finds a failing input, it goes a step further and systematically shrinks that input to the smallest possible counterexample.

How Shrinking Works

If st.lists(st.integers()) generates [483, -29, 0, 17, -6, 92] as a failing case, Hypothesis will try progressively smaller lists and simpler integers until it finds something like [0, -1] that still triggers the bug. The shrink is deterministic and cached - running the test again replays the exact minimal failing input from .hypothesis/examples/ before generating new random inputs.

This matters more than it sounds. A raw failing input of [483, -29, 0, 17, -6, 92] tells you almost nothing about the root cause. A shrunk input of [0, -1] immediately suggests the bug involves zero or negative numbers - that is a completely different debugging experience.

Reproducing Failures

Hypothesis provides several mechanisms for reproduction:

The .hypothesis/ database directory stores all discovered failing examples and replays them automatically
The @reproduce_failure(...) decorator is printed in the error output and can be pasted directly into the test file to pin the exact failing input
@settings(verbosity=Verbosity.verbose) shows every input Hypothesis tries during both generation and shrinking

Stateful Testing with RuleBasedStateMachine

For testing stateful systems - APIs, databases, data structures with multiple operations - Hypothesis provides RuleBasedStateMachine. Instead of testing individual functions, you define a set of operations (rules) and Hypothesis generates sequences of those operations:

from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
from hypothesis import strategies as st

class StackMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.stack = []
        self.model = []  # reference implementation

    @rule(value=st.integers())
    def push(self, value):
        self.stack.append(value)
        self.model.append(value)

    @rule()
    def pop(self):
        if self.model:
            expected = self.model.pop()
            actual = self.stack.pop()
            assert actual == expected

    @invariant()
    def lengths_match(self):
        assert len(self.stack) == len(self.model)

TestStack = StackMachine.TestCase

Hypothesis generates random sequences of push and pop calls and, when it finds a failing sequence, shrinks both the individual inputs and the operation order to produce a minimal reproducing trace. This catches bugs that only surface after specific sequences of operations - the kind of thing that is nearly impossible to cover with handwritten tests.

Integrating Hypothesis into a Real Project

Hypothesis pays off most when it runs automatically on every commit. The typical setup uses profiles: a lightweight one for local development and a thorough one for CI.

Setting Up Profiles

Create a conftest.py with environment-specific profiles:

# conftest.py
from hypothesis import settings, HealthCheck, Verbosity

settings.register_profile(
    "ci",
    max_examples=1000,
    deadline=None,  # CI runners are slow, avoid spurious timeouts
)
settings.register_profile(
    "dev",
    max_examples=50,
    deadline=200,   # fast feedback during development
)
settings.register_profile(
    "debug",
    max_examples=10,
    verbosity=Verbosity.verbose,
)

Select the profile via environment variable: HYPOTHESIS_PROFILE=ci pytest.

CI Configuration

In GitHub Actions or GitLab CI:

Run with the ci profile and deadline=None to avoid failures from slow runners
Cache the .hypothesis/ directory as a CI artifact so discovered bugs persist across runs
Hypothesis is safe to run in parallel with pytest-xdist (-n auto) since each worker gets an independent random seed

# .github/workflows/test.yml (relevant excerpt)
- name: Run tests
  env:
    HYPOTHESIS_PROFILE: ci
  run: pytest -n auto --hypothesis-seed=0
- uses: actions/cache@v4
  with:
    path: .hypothesis
    key: hypothesis-${{ runner.os }}-${{ hashFiles('tests/**') }}

Auto-Generating Test Stubs

The hypothesis write CLI tool generates property-based test stubs from type annotations:

hypothesis write mymodule.my_function

This inspects the function’s type hints and produces a ready-to-run test using inferred strategies - useful for bootstrapping property-based tests across an existing codebase without writing every strategy by hand.

Health Checks and Performance

Hypothesis includes health checks that warn you when a strategy is too slow or filters too aggressively. Suppress them only when genuinely needed:

@settings(suppress_health_check=[HealthCheck.too_slow])
@given(expensive_strategy())
def test_with_slow_setup(data):
    ...

For coverage reporting, combine with pytest-cov: pytest --cov --hypothesis-seed=0 pins the seed for reproducible coverage numbers. In practice, property-based tests tend to increase branch coverage by 10-20% over example-based tests alone, because they explore input combinations you would never write by hand.

Property-Based Testing Across Languages

Hypothesis is the dominant library for Python, but the idea of property-based testing originated with QuickCheck in Haskell and has spread to most major languages:

Language	Library	Key Difference
Haskell	QuickCheck	The original. Generates and shrinks based on type alone.
Python	Hypothesis	Strategy-based generation with integrated shrinking and database.
JavaScript/TypeScript	fast-check	Inspired by QuickCheck and Hypothesis, strong TypeScript support.
Rust	proptest	Inspired by Hypothesis. Strategies are aware of constraints, avoiding rejected inputs.

QuickCheck defines generation and shrinking per-type, which is simpler but less flexible. Hypothesis and proptest define them per-strategy, which means you can express constraints directly in the generator rather than filtering after the fact. fast-check follows a similar strategy-based model with good TypeScript integration.

If you work across multiple languages, the mental model carries over. Round-trip testing, idempotence checks, and oracle testing work the same way regardless of the library. Only the syntax and strategy names differ.

When Property-Based Testing Fits and When It Does Not

Property-based testing works best for functions with clear mathematical properties, well-defined input domains, and deterministic behavior. Pure functions, data transformations, parsers, serializers, and algorithms are ideal candidates.

It is less useful for code that is primarily about side effects (sending emails, writing to databases), code where the “correct” output is hard to specify without reimplementing the function, or GUI interactions. For these, traditional example-based tests or integration tests are a better fit.

A practical approach is to start with property-based tests for your core logic - the functions that transform data, validate inputs, or implement algorithms - and use example-based tests for integration and end-to-end scenarios. Even adding a handful of @given tests to a mature codebase tends to uncover bugs that years of example-based testing missed.

Contents