Hypothesis Property Testing: Find Edge Cases Automatically

Property-based testing with Hypothesis lets you define what your code must do. One classic rule: “encode, then decode, and you get the same input back.” Hypothesis then makes up hundreds of random inputs and hunts for cases that break the rule. You don’t write test cases by hand. You sketch the shape of valid inputs. The tool finds the off-by-one bugs, the odd Unicode strings, and the edge cases hiding in your code.
Install it with pip install hypothesis. Write strategies to sketch your input space, tag your test with @given(), and let the fuzzer do the work. Hypothesis is on version 6.151.x. It plugs into pytest
and unittest, runs on any CI, and has found bugs in CPython’s standard library, NumPy, and dozens of other open-source projects.
Why Example-Based Tests Leave Gaps
A typical unit test for a sort function might check that sort([3, 1, 2]) returns [1, 2, 3] and sort([]) returns []. That covers two cases. But what about negative numbers? Duplicates? Very large lists? Lists with float('nan')? Lists of one element? These are the inputs that cause real bugs. They are also the inputs most developers forget to test.
The core problem with example-based testing is selection bias. You pick inputs that match your mental model of how the code works. If you wrote the code, you already have blind spots: the same blind spots that produced the bug in the first place.
Property-based testing flips the approach. Instead of listing input-output pairs, you list rules that must hold for all valid inputs. For a sort function, those rules might be:
- The output has the same length as the input
- Each item at index
iis no greater than the item ati+1 - The output holds the same items as the input (same multiset)
Hypothesis then makes up random inputs: ints, strings, nested data, whatever you describe. It checks each one against your rules. When it finds an input that breaks a rule, it does not just print a random failing case. It shrinks the input to the smallest example it can find. You get a minimal reproducing case, and debugging gets much easier.
Four ideas sit at the core of Hypothesis. Strategies like st.integers(), st.text(), and st.lists() make up random data. Properties are test functions wrapped with @given(). Shrinking cuts a failing input down to a small example. And the Hypothesis database (the .hypothesis/ folder) keeps each failing example it has found and replays them on later runs. Once a bug is found, it stays in your suite even after the random seed moves on.
Core Strategies and the @given Decorator
Strategies are the building blocks of Hypothesis. Each one describes a space of possible values. Hypothesis samples from that space while the test runs.
Basic Strategies
The most common built-in strategies cover Python’s primitive types:
from hypothesis import given, strategies as st
# Integers with optional bounds
@given(st.integers())
def test_abs_non_negative(n):
assert abs(n) >= 0
# Floats, often excluding NaN for numeric properties
@given(st.floats(allow_nan=False, allow_infinity=False))
def test_float_round_trip(x):
assert float(str(x)) == x or x != x
# Text with constraints on alphabet and length
@given(st.text(min_size=1, alphabet=st.characters(categories=("L", "N"))))
def test_non_empty_string(s):
assert len(s) > 0
# Booleans, binary data, None
@given(st.booleans())
def test_bool_is_int(b):
assert isinstance(b, int)Collection Strategies
Collections compose basic strategies into more complex structures:
# Lists of integers, bounded in size
@given(st.lists(st.integers(), min_size=0, max_size=100))
def test_sort_preserves_length(lst):
assert len(sorted(lst)) == len(lst)
# Dictionaries with string keys and integer values
@given(st.dictionaries(st.text(min_size=1), st.integers()))
def test_dict_keys(d):
for key in d:
assert isinstance(key, str)
# Tuples with mixed types
@given(st.tuples(st.integers(), st.text()))
def test_tuple_unpacking(t):
n, s = t
assert isinstance(n, int)
assert isinstance(s, str)Composing and Transforming Strategies
You can combine strategies in several ways:
st.one_of(st.integers(), st.text())draws from either strategy (union types)strategy.map(func)transforms drawn values. For example,st.integers().map(abs)gives only zero or positive intsstrategy.filter(predicate)trims values. Use it sparingly. Filters that reject too much raiseUnsatisfiederrorsst.builds(MyClass, name=st.text(), age=st.integers(min_value=0))builds class instances by passing strategies as args
Controlling Test Volume
The @settings decorator controls how many inputs Hypothesis tries:
from hypothesis import given, settings, strategies as st
@given(st.lists(st.integers()))
@settings(max_examples=500)
def test_sort_idempotent(lst):
assert sorted(sorted(lst)) == sorted(lst)The default is 100 examples per test. You can pin key inputs alongside the random ones with @example:
from hypothesis import given, example, strategies as st
@given(st.lists(st.integers()))
@example([]) # always test empty list
@example([0, 0, 0]) # always test duplicates
def test_sort_preserves_elements(lst):
result = sorted(lst)
assert sorted(result) == resultWriting Effective Properties
Picking which rules to assert is where most people get stuck. The tools are simple. The hard part is turning “my code should be correct” into a concrete, testable rule. The patterns below cover the most common cases.
Round-Trip (Inverse) Property
If you have an encode/decode pair, the round-trip property is the most natural fit:
import json
from hypothesis import given, strategies as st
@given(st.dictionaries(st.text(), st.integers()))
def test_json_round_trip(data):
assert json.loads(json.dumps(data)) == dataThis pattern fits serializers like JSON, msgpack, and protobuf. It fits encoders like base64 and URL encoding. It fits compression and crypto. Any two functions that undo each other are a fit.
Idempotence Property
A function is idempotent if applying it twice gives the same result as applying it once:
@given(st.text())
def test_strip_idempotent(s):
assert s.strip().strip() == s.strip()This works for formatting, Unicode normalization (NFC), dedup, URL clean-up, and HTML sanitization. If f(f(x)) == f(x) doesn’t hold, you have a bug.
Invariant Preservation
After an operation, some structural properties must still hold:
@given(st.lists(st.integers(), min_size=1))
def test_sort_ordered(lst):
result = sorted(lst)
for i in range(len(result) - 1):
assert result[i] <= result[i + 1]More cases: after a push into a balanced BST, the tree stays balanced. After a database migration , row counts are kept. After adding an item to a cart, the total stays at zero or above.
Oracle (Differential) Testing
Compare your code against a known-correct reference:
import statistics
from hypothesis import given, strategies as st
@given(st.lists(st.floats(allow_nan=False, allow_infinity=False), min_size=1))
def test_custom_median(data):
assert my_fast_median(data) == statistics.median(data)Use this to check fast code against slower but trusted reference code. Use it to test C extensions against their pure-Python fallbacks.
Commutativity and Associativity
For operations that should be order-independent:
from hypothesis import given, strategies as st
@given(st.integers(), st.integers())
def test_addition_commutative(a, b):
assert a + b == b + aHandy for custom numeric types, merge ops, and set-like data structures.
Using assume() for Preconditions
When inputs must meet a condition, use assume() to skip the bad ones:
from hypothesis import given, assume, strategies as st
@given(st.integers(), st.integers())
def test_division(a, b):
assume(b != 0)
assert (a // b) * b + (a % b) == aPrefer to constrain the strategy itself when you can (e.g., st.integers().filter(lambda x: x != 0) or st.integers(min_value=1)). assume throws inputs out after they’re drawn, which is wasteful.
Custom Strategies with @st.composite
Some domain data can’t be built from basic strategy combinators. For those, the @st.composite decorator lets you write plain Python code that draws from other strategies:
from hypothesis import strategies as st
@st.composite
def ordered_pairs(draw):
"""Generate a tuple (a, b) where a <= b."""
a = draw(st.integers())
b = draw(st.integers(min_value=a))
return (a, b)
@st.composite
def user_records(draw, min_age=0, max_age=150):
"""Generate realistic user records."""
name = draw(st.text(min_size=1, max_size=50,
alphabet=st.characters(categories=("L",))))
age = draw(st.integers(min_value=min_age, max_value=max_age))
email = draw(st.emails())
return {"name": name, "age": age, "email": email}The draw function pulls a value from any strategy. You can use if-statements, loops, and ties between drawn values. This is how you build strategies for complex domain objects: database rows, API payloads, graph structures.
Shrinking, Reproducing, and Debugging Failures
When Hypothesis finds a failing input, it goes one step further. It shrinks that input to the smallest counterexample it can find.
How Shrinking Works
Say st.lists(st.integers()) draws [483, -29, 0, 17, -6, 92] and the test fails. Hypothesis then tries smaller lists and simpler integers. It keeps going until it lands on something like [0, -1] that still triggers the bug. The shrink is deterministic and cached. Running the test again replays the exact minimal failing input from .hypothesis/examples/ before any new random inputs.
The payoff is bigger than it sounds. A raw failing input of [483, -29, 0, 17, -6, 92] tells you almost nothing about the root cause. A shrunk input of [0, -1] points straight at zero or negative numbers. That is a very different debugging experience.
Reproducing Failures
Hypothesis provides several mechanisms for reproduction:
- The
.hypothesis/database folder keeps every failing example it has found and replays them on its own - The
@reproduce_failure(...)decorator is printed in the error output. Paste it into the test file to pin the exact failing input @settings(verbosity=Verbosity.verbose)shows every input Hypothesis tries, both while drawing and while shrinking
Stateful Testing with RuleBasedStateMachine
Some systems hold state: APIs, databases, data structures with many ops. For those, Hypothesis ships RuleBasedStateMachine. Instead of testing one function at a time, you define each op as a rule. Hypothesis then runs random sequences of those rules:
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
from hypothesis import strategies as st
class StackMachine(RuleBasedStateMachine):
def __init__(self):
super().__init__()
self.stack = []
self.model = [] # reference implementation
@rule(value=st.integers())
def push(self, value):
self.stack.append(value)
self.model.append(value)
@rule()
def pop(self):
if self.model:
expected = self.model.pop()
actual = self.stack.pop()
assert actual == expected
@invariant()
def lengths_match(self):
assert len(self.stack) == len(self.model)
TestStack = StackMachine.TestCaseHypothesis runs random runs of push and pop calls. When a run fails, it shrinks both the inputs and the call order. You get a minimal reproducing trace. This catches bugs that only show up after a specific run of calls. That is the kind of bug you can almost never cover by hand.
Integrating Hypothesis into a Real Project
Hypothesis pays off most when it runs on every commit. The usual setup uses profiles: a light one for local dev and a heavy one for CI.
Setting Up Profiles
Create a conftest.py with one profile per environment:
# conftest.py
from hypothesis import settings, HealthCheck, Verbosity
settings.register_profile(
"ci",
max_examples=1000,
deadline=None, # CI runners are slow, avoid spurious timeouts
)
settings.register_profile(
"dev",
max_examples=50,
deadline=200, # fast feedback during development
)
settings.register_profile(
"debug",
max_examples=10,
verbosity=Verbosity.verbose,
)Select the profile via environment variable: HYPOTHESIS_PROFILE=ci pytest.
CI Configuration
In GitHub Actions or GitLab CI:
- Run with the
ciprofile anddeadline=Noneso slow runners don’t trigger spurious fails - Cache the
.hypothesis/folder as a CI artifact so found bugs stick around across runs - Hypothesis is safe to run in parallel with
pytest-xdist(-n auto). Each worker gets its own random seed
# .github/workflows/test.yml (relevant excerpt)
- name: Run tests
env:
HYPOTHESIS_PROFILE: ci
run: pytest -n auto --hypothesis-seed=0
- uses: actions/cache@v4
with:
path: .hypothesis
key: hypothesis-${{ runner.os }}-${{ hashFiles('tests/**') }}Auto-Generating Test Stubs
The hypothesis write CLI builds property-based test stubs from type hints:
hypothesis write mymodule.my_functionIt reads the function’s type hints and spits out a test that uses guessed strategies. Handy for bootstrapping property tests across an old codebase without writing every strategy by hand.
Health Checks and Performance
Hypothesis ships health checks that warn you when a strategy is too slow or filters too much. Turn them off only when you truly need to:
@settings(suppress_health_check=[HealthCheck.too_slow])
@given(expensive_strategy())
def test_with_slow_setup(data):
...For coverage reports, pair Hypothesis with pytest-cov. pytest --cov --hypothesis-seed=0 pins the seed so coverage numbers stay stable. In practice, property tests tend to lift branch coverage by 10-20% over example tests alone. They reach input combos you would never write by hand.
Property-Based Testing Across Languages
Hypothesis is the top library for Python. The idea started with QuickCheck in Haskell. It has spread to most major languages:
| Language | Library | Key Difference |
|---|---|---|
| Haskell | QuickCheck | The original. Generates and shrinks based on type alone. |
| Python | Hypothesis | Strategy-based generation with integrated shrinking and database. |
| JavaScript/TypeScript | fast-check | Inspired by QuickCheck and Hypothesis, strong TypeScript support. |
| Rust | proptest | Inspired by Hypothesis. Strategies are aware of constraints, avoiding rejected inputs. |
QuickCheck ties draws and shrinks to each type. That is simpler but less flexible. Hypothesis and proptest tie them to each strategy. You can bake limits into the strategy rather than filter after the draw. fast-check uses a similar per-strategy model with strong TypeScript support.
If you work in many languages, the mental model carries over. Round-trip tests, idempotence checks, and oracle tests work the same way in every library. Only the syntax and the strategy names change.
When Property-Based Testing Fits and When It Does Not
Property-based testing fits best when a function has clear math rules and a tight input range, and gives the same output for the same input. Pure functions, data transforms, parsers, serializers, and algorithms are prime fits.
It is less useful for code that is mostly side effects, like sending emails or writing to a database. It is also weak when the “right” output is hard to pin down without rebuilding the function. The same goes for GUI flows. For those, plain example-based tests or integration tests work better.
A good rule: start with property-based tests for your core logic. That is, the code that transforms data, validates inputs, or runs algorithms. Keep example-based tests for end-to-end flows. Even a handful of @given tests bolted onto a mature codebase tends to find bugs that years of example-based testing missed.
Botmonster Tech