Rust for Python Developers: Rewrite Your Hot Paths for 10x Speed

Contents

Python is excellent for most of what developers throw at it - API servers, data pipelines, automation scripts, machine learning glue code. But CPU-bound work is a different story. When you’re parsing 500MB log files, running simulation loops, or crunching millions of rows in a tight inner loop, you’re going to hit a wall. Not always, but often enough that it becomes a real problem.

The solution is not to rewrite your entire application in Rust. That’s dramatic and usually unnecessary. The better approach is to profile your code, find the 5-10% that consumes most of the CPU time, and rewrite just that part in Rust. The rest of your codebase stays Python. Your interfaces stay Python. You just swap out the slow function for a fast one.

This is what powers Polars (a Rust-backed DataFrame library that beats pandas on most benchmarks), ruff (a Python linter written in Rust that’s 10-100x faster than alternatives), and the cryptography package. You don’t always see the Rust underneath, but it’s there doing the heavy lifting.

Finding What’s Actually Slow

Before writing any Rust, you need to know which functions are actually the problem. Guessing is a waste of time - the bottleneck is rarely where you think it is.

The standard starting point is cProfile , which ships with Python:

python -m cProfile -s cumtime my_script.py

This outputs a table sorted by cumulative time, showing which functions consume the most CPU. Something like:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    8.312    0.000   11.204    0.000 parser.py:42(parse_line)
        1    0.001    0.001   11.205   11.205 main.py:10(process_file)

If parse_line is called a million times and accumulates 11 seconds, that’s your target.

For finer-grained work, line_profiler lets you profile a specific function line by line. Decorate the function with @profile and run kernprof -l -v my_script.py. This tells you exactly which line is the bottleneck, which matters when the function is complex and non-obvious.

For production systems where you can’t easily instrument code, py-spy is a sampling profiler that attaches to a running Python process without any code changes:

py-spy record -o profile.svg --pid 12345

It generates a flamegraph you can open in a browser. No restarts, no instrumentation, and negligible overhead for a brief profiling session.

py-spy's live profiling view — attach to a running process and see which functions consume CPU time in real time

py-spy flame graph showing hierarchical function call times for a Python program with wider bars indicating more CPU time — A py-spy flame graph — wider bars mean more time spent in that function, making hot paths immediately visible

Image: py-spy

Once you’ve found your hot path, benchmark it with timeit or pytest-benchmark and record the baseline. You’ll need that number later to verify your speedup is real. If your profiler also points to memory pressure alongside CPU time, see our guide on profiling Python memory usage — tools like memray and tracemalloc can reveal whether your hot path is also allocating heavily.

Setting Up Rust and PyO3

If you don’t have Rust installed, the official way is through rustup :

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

The key tool for bridging Rust and Python is Maturin , which handles everything from compiling the Rust code to building a proper Python wheel. Install it with pip:

pip install maturin

To start a new Rust extension project:

maturin new --bindings pyo3 my_fast_lib
cd my_fast_lib

This creates a project with this structure:

my_fast_lib/
  src/
    lib.rs          # Your Rust code
  my_fast_lib/
    __init__.py     # Python wrapper
  Cargo.toml        # Rust dependencies and config
  pyproject.toml    # Python package config

During development, run maturin develop after each change. It compiles the extension and installs it into your current virtual environment in one step. For benchmarking, use maturin develop --release to get fully optimized builds - debug builds can be 5-10x slower than release builds.

If you use VS Code, the rust-analyzer extension gives you inline type annotations, autocomplete, and error highlighting. It makes working in Rust considerably more approachable when you’re coming from a dynamically typed language.

Writing Your First PyO3 Function

PyO3 is the library that makes all of this work. It provides Rust bindings for the CPython API and a set of macros that handle the boilerplate of registering functions, converting types, and managing reference counts. You don’t need to understand all of CPython’s internals to use it - the macros take care of the error-prone parts.

Here’s a simple PyO3 extension - a function that counts word frequencies in a string:

use pyo3::prelude::*;
use std::collections::HashMap;

#[pyfunction]
fn word_count(text: &str) -> HashMap<String, usize> {
    let mut counts = HashMap::new();
    for word in text.split_whitespace() {
        let word = word.to_lowercase();
        let word = word.trim_matches(|c: char| !c.is_alphabetic());
        if !word.is_empty() {
            *counts.entry(word.to_string()).or_insert(0) += 1;
        }
    }
    counts
}

#[pymodule]
fn my_fast_lib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(word_count, m)?)?;
    Ok(())
}

The #[pyfunction] macro handles type conversions between Python and Rust automatically. &str in Rust maps to a Python string. HashMap<String, usize> maps to a Python dict. On the Python side, you call it like any other function:

import my_fast_lib

counts = my_fast_lib.word_count(large_text)

For a 50MB text file, a pure Python version of this function runs in roughly 400ms. The Rust version runs in about 40ms. Same interface, drop-in replacement.

For operations that take a non-trivial amount of time, you should release the GIL so other Python threads can run concurrently:

#[pyfunction]
fn heavy_computation(py: Python, data: Vec<u8>) -> PyResult<Vec<u8>> {
    py.allow_threads(|| {
        // Pure Rust computation here - GIL is released
        Ok(process_data(data))
    })
}

This is important for multithreaded applications. Without releasing the GIL, your Rust function holds it for its entire duration, blocking every other Python thread.

Error handling uses PyResult<T>. To raise a Python exception from Rust:

use pyo3::exceptions::PyValueError;

#[pyfunction]
fn parse_thing(input: &str) -> PyResult<String> {
    if input.is_empty() {
        return Err(PyValueError::new_err("Input cannot be empty"));
    }
    Ok(input.to_uppercase())
}

On the Python side, this raises a standard ValueError that you catch with a normal try/except.

A Real Example: Log Processing

Here’s a concrete scenario. A Python service processes a 10-million-line log file, extracts fields with a regex, and aggregates request counts per endpoint. In Python it takes about 12 seconds.

The Python version:

import re
from collections import defaultdict

def aggregate_log(filepath):
    pattern = re.compile(r'GET (/[^\s]+) HTTP')
    counts = defaultdict(int)
    with open(filepath, 'rb') as f:
        for line in f:
            m = pattern.search(line.decode('utf-8', errors='replace'))
            if m:
                counts[m.group(1)] += 1
    return dict(counts)

Runtime: ~12 seconds on a 500MB log file.

The Rust rewrite skips the regex and parses the relevant fields manually using byte searching. For a predictable log format, this is considerably faster:

use pyo3::prelude::*;
use std::collections::HashMap;
use std::fs::File;
use std::io::{BufRead, BufReader};

#[pyfunction]
fn aggregate_log(py: Python, filepath: &str) -> PyResult<HashMap<String, u64>> {
    let result = py.allow_threads(|| -> Result<HashMap<String, u64>, std::io::Error> {
        let file = File::open(filepath)?;
        let reader = BufReader::new(file);
        let mut counts: HashMap<String, u64> = HashMap::new();

        for line in reader.lines() {
            let line = line?;
            if let Some(start) = line.find("GET /") {
                let rest = &line[start + 4..];
                if let Some(end) = rest.find(' ') {
                    let path = &rest[..end];
                    *counts.entry(path.to_string()).or_insert(0) += 1;
                }
            }
        }
        Ok(counts)
    });

    result.map_err(|e| pyo3::exceptions::PyIOError::new_err(e.to_string()))
}

Runtime: ~0.9 seconds. A 13x speedup with an identical call signature on the Python side.

For even more throughput, add rayon to your Cargo.toml:

[dependencies]
rayon = "1.10"
pyo3 = { version = "0.21", features = ["extension-module"] }

Rayon provides data-parallel iterators. You split the file into chunks and process them across all available CPU cores. Combined with py.allow_threads(), Python’s GIL is completely out of the picture for the duration of the computation. On an 8-core machine, the same 10-million-line file can come down to under 200ms.

Testing Your Extension

PyO3 extensions are tested like any other Python code. Once you’ve run maturin develop, import the module in your test file and write pytest tests for it:

from my_fast_lib import word_count, aggregate_log

def test_word_count_basic():
    counts = word_count("hello world hello")
    assert counts["hello"] == 2
    assert counts["world"] == 1

def test_word_count_empty():
    counts = word_count("")
    assert counts == {}

This is one of the practical advantages of the PyO3 approach over subprocess: you get Python’s full testing ecosystem working against your Rust code. pytest fixtures, parametrize, coverage tools - all of it works. The Rust code is just another Python module from the test runner’s perspective. For functions that handle varied or edge-case input, property-based testing with Hypothesis is worth adding — it generates hundreds of randomized inputs automatically and is particularly good at finding the boundary cases that handwritten examples miss.

For testing the Rust side independently, you can write standard Rust unit tests inside lib.rs using #[cfg(test)] blocks. These run with cargo test and don’t involve Python at all, which makes them faster to run during Rust development.

When a CLI Tool is Simpler

Not every hot path needs PyO3. If the operation is essentially one-way - data flows in, processed data flows out, and there’s no need to pass Python objects back and forth - a standalone Rust binary called via subprocess is often simpler to build and maintain:

import subprocess
import json

result = subprocess.run(
    ["./my_rust_tool", "--format=json"],
    input=raw_data,
    capture_output=True,
    timeout=30,
)
output = json.loads(result.stdout)

The advantages are real: no Python packaging complexity, works from any language, easier to test the Rust side in isolation, and you can distribute or update the binary independently. A CI pipeline that runs a Rust binary is simpler than one that builds a PyO3 wheel for multiple platforms.

The tradeoff is process startup overhead - typically 10-50ms on Linux depending on the binary size. That makes subprocess unsuitable for functions called in a loop or many times per second, but completely reasonable for operations that run once per batch job and take more than a few hundred milliseconds anyway.

To ship a Rust binary alongside a Python package, include it in package_data in pyproject.toml and write a thin Python wrapper that locates the binary relative to the package directory. importlib.resources is the cleanest approach in modern Python (3.9+).

Packaging and Distribution

When you’re ready to distribute your PyO3 extension, maturin build --release produces a wheel:

maturin build --release

Always use the --release flag for benchmarking and production - it enables full Rust optimizations. Debug builds skip these and can be many times slower.

Publishing to PyPI:

maturin publish

The tricky part with compiled extensions is that you need wheels for each target platform: Linux x86-64, Linux ARM64, macOS Intel, macOS Apple Silicon, Windows. The PyO3/maturin-action GitHub Actions workflow handles this by building on each platform in CI:

- uses: PyO3/maturin-action@v1
  with:
    command: build
    args: --release --out dist
    manylinux: auto

The manylinux target builds Linux wheels compatible with any glibc-based Linux distribution from roughly the past decade, which covers essentially all realistic Linux deployment targets.

One practical note: Maturin uses the Python package version from pyproject.toml for both the Python package and the Rust crate by default. Keep a single source of truth there and you won’t have to think about version sync.

What to Expect

The speedup from PyO3 varies significantly by workload. For CPU-bound work with tight loops, text parsing, or manual data transformations, 10-20x improvements are realistic. For operations that are mostly I/O-bound - network calls, database queries, disk reads - Rust won’t help much because the bottleneck isn’t the interpreter.

The development cost is also real. Rust has a learning curve, compilation adds time to your dev loop, and debugging across the FFI boundary is more involved than pure Python. It makes most sense when the performance gain is substantial and the hot path is stable - something you write once and don’t touch often. Rust’s standing as a systems language is also growing beyond application code: it recently became a first-class language in the Linux kernel , which reflects the broader ecosystem confidence in its safety and performance guarantees.

Start small: pick your one slowest function, rewrite it as a PyO3 extension, and measure the result. If the speedup justifies the maintenance overhead, expand from there. Projects that benefit from this approach typically end up with a handful of Rust functions - not a Rust codebase with a thin Python wrapper on top.

It’s also worth being honest about when this isn’t the right tool. If your bottleneck is a third-party library that you don’t control (say, a NumPy operation or a database driver), rewriting the caller in Rust does nothing - the bottleneck is still inside that library. If your bottleneck is network latency or database query time, no amount of faster parsing or aggregation on your end helps. Profile first, and make sure you understand why something is slow before deciding how to fix it.

The PyO3 documentation is thorough and covers terrain this post doesn’t - async Python/Rust interop, working with NumPy arrays directly via the numpy crate, shared memory patterns, and more advanced GIL management. It’s worth reading once you have a basic extension working.