How to Use Property-Based Testing in Python with Hypothesis

Property-based testing with Hypothesis lets you define the properties your code must satisfy - such as “encoding then decoding always returns the original input” - and then automatically generates hundreds or thousands of randomized test inputs to find counterexamples. Instead of writing individual test cases by hand, you describe the shape of valid inputs and let the framework discover the off-by-one errors, Unicode edge cases, and boundary conditions hiding in your code.
Install it with pip install hypothesis, write strategies to describe your input space, decorate your test function with @given(), and let the fuzzer do the work. Hypothesis (currently at version 6.151.x as of early 2026) integrates natively with pytest
and unittest, runs on any CI system, and has found bugs in CPython’s standard library, NumPy, and dozens of other major open-source projects.
Why Example-Based Tests Leave Gaps
A typical unit test for a sorting function might check that sort([3, 1, 2]) returns [1, 2, 3] and sort([]) returns []. That covers two cases. But what about negative numbers? Duplicates? Very large lists? Lists containing float('nan')? Lists with a single element? These are exactly the inputs that cause real-world bugs, and they are exactly the inputs most developers forget to test.
The fundamental problem with example-based testing is selection bias. You pick inputs that match your mental model of how the code works. If you wrote the code, you already have blind spots - the same blind spots that produced the bug in the first place.
Property-based testing inverts the approach. Instead of specifying input-output pairs, you specify invariants that must hold for all valid inputs. For a sort function, those invariants might be:
- The output has the same length as the input
- Every element at index
iis less than or equal to the element at indexi+1 - The output contains exactly the same elements as the input (same multiset)
Hypothesis then generates random inputs - integers, strings, nested data structures, whatever you describe - and checks whether those invariants hold. When it finds an input that violates a property, it does not just report a random failing case. It shrinks the input to the smallest possible counterexample, giving you a minimal reproducing case that makes debugging straightforward.
Four concepts make up the core of Hypothesis. Strategies like st.integers(), st.text(), and st.lists() generate random data. Properties are test functions decorated with @given(). Shrinking automatically reduces a failing input to the minimal reproducing example. And the Hypothesis database (the .hypothesis/ directory) stores previously found failing examples and replays them in future test runs, so once a bug is found it stays in your regression suite even after the randomness moves on.
Core Strategies and the @given Decorator
Strategies are the building blocks of Hypothesis. Each one describes a space of possible values, and Hypothesis samples from that space during test execution.
Basic Strategies
The most common built-in strategies cover Python’s primitive types:
from hypothesis import given, strategies as st
# Integers with optional bounds
@given(st.integers())
def test_abs_non_negative(n):
assert abs(n) >= 0
# Floats, often excluding NaN for numeric properties
@given(st.floats(allow_nan=False, allow_infinity=False))
def test_float_round_trip(x):
assert float(str(x)) == x or x != x
# Text with constraints on alphabet and length
@given(st.text(min_size=1, alphabet=st.characters(categories=("L", "N"))))
def test_non_empty_string(s):
assert len(s) > 0
# Booleans, binary data, None
@given(st.booleans())
def test_bool_is_int(b):
assert isinstance(b, int)Collection Strategies
Collections compose basic strategies into more complex structures:
# Lists of integers, bounded in size
@given(st.lists(st.integers(), min_size=0, max_size=100))
def test_sort_preserves_length(lst):
assert len(sorted(lst)) == len(lst)
# Dictionaries with string keys and integer values
@given(st.dictionaries(st.text(min_size=1), st.integers()))
def test_dict_keys(d):
for key in d:
assert isinstance(key, str)
# Tuples with mixed types
@given(st.tuples(st.integers(), st.text()))
def test_tuple_unpacking(t):
n, s = t
assert isinstance(n, int)
assert isinstance(s, str)Composing and Transforming Strategies
You can combine strategies in several ways:
st.one_of(st.integers(), st.text())generates values from either strategy (union types)strategy.map(func)transforms generated values - for example,st.integers().map(abs)generates non-negative integersstrategy.filter(predicate)constrains values, though use this sparingly since filters that reject too many values raiseUnsatisfiederrorsst.builds(MyClass, name=st.text(), age=st.integers(min_value=0))constructs class instances by mapping strategies to constructor parameters
Controlling Test Volume
The @settings decorator controls how many inputs Hypothesis tries:
from hypothesis import given, settings, strategies as st
@given(st.lists(st.integers()))
@settings(max_examples=500)
def test_sort_idempotent(lst):
assert sorted(sorted(lst)) == sorted(lst)The default is 100 examples per test. You can pin specific important inputs alongside the generated ones using @example:
from hypothesis import given, example, strategies as st
@given(st.lists(st.integers()))
@example([]) # always test empty list
@example([0, 0, 0]) # always test duplicates
def test_sort_preserves_elements(lst):
result = sorted(lst)
assert sorted(result) == resultWriting Effective Properties
Choosing which properties to assert is where most people get stuck. The tooling is straightforward, but translating “my code should be correct” into a concrete, testable invariant takes practice. These patterns cover the most common scenarios.
Round-Trip (Inverse) Property
If you have an encode/decode pair, the round-trip property is the most natural fit:
import json
from hypothesis import given, strategies as st
@given(st.dictionaries(st.text(), st.integers()))
def test_json_round_trip(data):
assert json.loads(json.dumps(data)) == dataThis pattern works for serializers (JSON, msgpack, protobuf), encoders (base64, URL encoding), compression, and encryption/decryption. Any pair of functions where one is the inverse of the other is a candidate.
Idempotence Property
A function is idempotent if applying it twice gives the same result as applying it once:
@given(st.text())
def test_strip_idempotent(s):
assert s.strip().strip() == s.strip()This applies to formatting functions, Unicode normalization (NFC), deduplication, URL canonicalization, and HTML sanitization. If f(f(x)) == f(x) does not hold, you have a bug.
Invariant Preservation
After an operation, certain structural properties must still hold:
@given(st.lists(st.integers(), min_size=1))
def test_sort_ordered(lst):
result = sorted(lst)
for i in range(len(result) - 1):
assert result[i] <= result[i + 1]Other examples: after inserting into a balanced BST, the tree is still balanced. After a database migration, row counts are preserved. After adding an item to a cart, the total is non-negative.
Oracle (Differential) Testing
Compare your implementation against a known-correct reference:
import statistics
from hypothesis import given, strategies as st
@given(st.lists(st.floats(allow_nan=False, allow_infinity=False), min_size=1))
def test_custom_median(data):
assert my_fast_median(data) == statistics.median(data)You can use this to validate optimized implementations against slower but trusted reference code, or to test C extensions against pure-Python fallbacks.
Commutativity and Associativity
For operations that should be order-independent:
from hypothesis import given, strategies as st
@given(st.integers(), st.integers())
def test_addition_commutative(a, b):
assert a + b == b + aUseful for custom numeric types, merge operations, and set-like data structures.
Using assume() for Preconditions
When inputs must satisfy certain conditions, use assume() to skip invalid ones:
from hypothesis import given, assume, strategies as st
@given(st.integers(), st.integers())
def test_division(a, b):
assume(b != 0)
assert (a // b) * b + (a % b) == aPrefer constraining the strategy itself when possible (e.g., st.integers().filter(lambda x: x != 0) or st.integers(min_value=1)), since assume discards inputs after generation.
Custom Strategies with @st.composite
For domain-specific data that cannot be expressed with basic strategy combinators, the @st.composite decorator lets you write arbitrary Python code that draws from other strategies:
from hypothesis import strategies as st
@st.composite
def ordered_pairs(draw):
"""Generate a tuple (a, b) where a <= b."""
a = draw(st.integers())
b = draw(st.integers(min_value=a))
return (a, b)
@st.composite
def user_records(draw, min_age=0, max_age=150):
"""Generate realistic user records."""
name = draw(st.text(min_size=1, max_size=50,
alphabet=st.characters(categories=("L",))))
age = draw(st.integers(min_value=min_age, max_value=max_age))
email = draw(st.emails())
return {"name": name, "age": age, "email": email}The draw function pulls a value from any strategy, and you can use conditional logic, loops, and data dependencies between drawn values. This is how you build strategies for complex domain objects like database rows, API request payloads, or graph structures.
Shrinking, Reproducing, and Debugging Failures
When Hypothesis finds a failing input, it goes a step further and systematically shrinks that input to the smallest possible counterexample.
How Shrinking Works
If st.lists(st.integers()) generates [483, -29, 0, 17, -6, 92] as a failing case, Hypothesis will try progressively smaller lists and simpler integers until it finds something like [0, -1] that still triggers the bug. The shrink is deterministic and cached - running the test again replays the exact minimal failing input from .hypothesis/examples/ before generating new random inputs.
This matters more than it sounds. A raw failing input of [483, -29, 0, 17, -6, 92] tells you almost nothing about the root cause. A shrunk input of [0, -1] immediately suggests the bug involves zero or negative numbers - that is a completely different debugging experience.
Reproducing Failures
Hypothesis provides several mechanisms for reproduction:
- The
.hypothesis/database directory stores all discovered failing examples and replays them automatically - The
@reproduce_failure(...)decorator is printed in the error output and can be pasted directly into the test file to pin the exact failing input @settings(verbosity=Verbosity.verbose)shows every input Hypothesis tries during both generation and shrinking
Stateful Testing with RuleBasedStateMachine
For testing stateful systems - APIs, databases, data structures with multiple operations - Hypothesis provides RuleBasedStateMachine. Instead of testing individual functions, you define a set of operations (rules) and Hypothesis generates sequences of those operations:
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
from hypothesis import strategies as st
class StackMachine(RuleBasedStateMachine):
def __init__(self):
super().__init__()
self.stack = []
self.model = [] # reference implementation
@rule(value=st.integers())
def push(self, value):
self.stack.append(value)
self.model.append(value)
@rule()
def pop(self):
if self.model:
expected = self.model.pop()
actual = self.stack.pop()
assert actual == expected
@invariant()
def lengths_match(self):
assert len(self.stack) == len(self.model)
TestStack = StackMachine.TestCaseHypothesis generates random sequences of push and pop calls and, when it finds a failing sequence, shrinks both the individual inputs and the operation order to produce a minimal reproducing trace. This catches bugs that only surface after specific sequences of operations - the kind of thing that is nearly impossible to cover with handwritten tests.
Integrating Hypothesis into a Real Project
Hypothesis pays off most when it runs automatically on every commit. The typical setup uses profiles: a lightweight one for local development and a thorough one for CI.
Setting Up Profiles
Create a conftest.py with environment-specific profiles:
# conftest.py
from hypothesis import settings, HealthCheck, Verbosity
settings.register_profile(
"ci",
max_examples=1000,
deadline=None, # CI runners are slow, avoid spurious timeouts
)
settings.register_profile(
"dev",
max_examples=50,
deadline=200, # fast feedback during development
)
settings.register_profile(
"debug",
max_examples=10,
verbosity=Verbosity.verbose,
)Select the profile via environment variable: HYPOTHESIS_PROFILE=ci pytest.
CI Configuration
In GitHub Actions or GitLab CI:
- Run with the
ciprofile anddeadline=Noneto avoid failures from slow runners - Cache the
.hypothesis/directory as a CI artifact so discovered bugs persist across runs - Hypothesis is safe to run in parallel with
pytest-xdist(-n auto) since each worker gets an independent random seed
# .github/workflows/test.yml (relevant excerpt)
- name: Run tests
env:
HYPOTHESIS_PROFILE: ci
run: pytest -n auto --hypothesis-seed=0
- uses: actions/cache@v4
with:
path: .hypothesis
key: hypothesis-${{ runner.os }}-${{ hashFiles('tests/**') }}Auto-Generating Test Stubs
The hypothesis write CLI tool generates property-based test stubs from type annotations:
hypothesis write mymodule.my_functionThis inspects the function’s type hints and produces a ready-to-run test using inferred strategies - useful for bootstrapping property-based tests across an existing codebase without writing every strategy by hand.
Health Checks and Performance
Hypothesis includes health checks that warn you when a strategy is too slow or filters too aggressively. Suppress them only when genuinely needed:
@settings(suppress_health_check=[HealthCheck.too_slow])
@given(expensive_strategy())
def test_with_slow_setup(data):
...For coverage reporting, combine with pytest-cov: pytest --cov --hypothesis-seed=0 pins the seed for reproducible coverage numbers. In practice, property-based tests tend to increase branch coverage by 10-20% over example-based tests alone, because they explore input combinations you would never write by hand.
Property-Based Testing Across Languages
Hypothesis is the dominant library for Python, but the idea of property-based testing originated with QuickCheck in Haskell and has spread to most major languages:
| Language | Library | Key Difference |
|---|---|---|
| Haskell | QuickCheck | The original. Generates and shrinks based on type alone. |
| Python | Hypothesis | Strategy-based generation with integrated shrinking and database. |
| JavaScript/TypeScript | fast-check | Inspired by QuickCheck and Hypothesis, strong TypeScript support. |
| Rust | proptest | Inspired by Hypothesis. Strategies are aware of constraints, avoiding rejected inputs. |
QuickCheck defines generation and shrinking per-type, which is simpler but less flexible. Hypothesis and proptest define them per-strategy, which means you can express constraints directly in the generator rather than filtering after the fact. fast-check follows a similar strategy-based model with good TypeScript integration.
If you work across multiple languages, the mental model carries over. Round-trip testing, idempotence checks, and oracle testing work the same way regardless of the library. Only the syntax and strategy names differ.
When Property-Based Testing Fits and When It Does Not
Property-based testing works best for functions with clear mathematical properties, well-defined input domains, and deterministic behavior. Pure functions, data transformations, parsers, serializers, and algorithms are ideal candidates.
It is less useful for code that is primarily about side effects (sending emails, writing to databases), code where the “correct” output is hard to specify without reimplementing the function, or GUI interactions. For these, traditional example-based tests or integration tests are a better fit.
A practical approach is to start with property-based tests for your core logic - the functions that transform data, validate inputs, or implement algorithms - and use example-based tests for integration and end-to-end scenarios. Even adding a handful of @given tests to a mature codebase tends to uncover bugs that years of example-based testing missed.