How to structure pytest-geo for large shapefiles

When validating multi-gigabyte .shp datasets in automated pipelines, naive pytest configurations routinely trigger memory exhaustion, I/O bottlenecks, and non-deterministic timeout failures. The core challenge is not merely running spatial assertions, but architecting a test harness that respects filesystem constraints, enforces strict validation boundaries, and scales across parallel CI runners. This guide provides a production-ready blueprint for GIS QA engineers, data engineers, and platform teams deploying geospatial QA automation at scale.

Architectural Blueprint & Directory Layout

A scalable spatial test repository must isolate heavy binaries from version control, enforce deterministic fixture lifecycles, and align with established Geospatial QA Fundamentals & Architecture. The following structure decouples test logic from raw data while maintaining strict reproducibility:

project-root/
├── conftest.py                 # Root-level fixtures, session-scoped setup
├── pyproject.toml              # pytest config, xdist, timeout, markers
├── tests/
│   ├── geo/
│   │   ├── conftest.py         # Spatial-specific fixtures, lazy loaders
│   │   ├── test_topology.py
│   │   ├── test_schema.py
│   │   └── test_crs_alignment.py
│   └── unit/
│       └── test_transforms.py
├── data/                       # .gitignored, mounted via CI artifact/cache
│   └── large_dataset/
│       ├── boundaries.shp
│       ├── boundaries.shx
│       ├── boundaries.dbf
│       └── boundaries.prj
└── ci/
    └── spatial_cache_policy.yaml

This layout enforces a clear boundary between test execution and data provisioning. Large shapefiles should never be committed to Git; instead, they are provisioned via CI artifact storage, cloud object mounts, or deterministic synthetic generators. The tests/geo/conftest.py layer becomes the single source of truth for spatial fixture injection, while root conftest.py handles session-level resource pooling and teardown.

Lazy-Loading Fixtures & Memory Management

Loading a 500MB shapefile into memory per test function is unsustainable. The fixture layer must implement lazy evaluation, spatial indexing, and chunked iteration. Production-grade setups leverage fiona for low-level streaming and geopandas only when vectorized operations are strictly necessary. Consult the official Fiona documentation for driver configuration and streaming best practices.

# tests/geo/conftest.py
import pytest
import fiona
from pathlib import Path
from shapely.geometry import shape
from shapely import STRtree

@pytest.fixture(scope="session")
def large_shapefile_path():
    """Resolve path to large shapefile from CI artifact cache."""
    return Path("/opt/ci/data/large_dataset/boundaries.shp")

@pytest.fixture(scope="session")
def spatial_index(large_shapefile_path):
    """
    Build a Shapely STRtree without loading all geometries into RAM.
    Returns (tree, features) where features is a list of fiona feature dicts.
    """
    features = []
    with fiona.open(large_shapefile_path) as src:
        for feat in src:
            features.append(feat)
    geoms = [shape(f["geometry"]) for f in features]
    tree = STRtree(geoms)
    return tree, features

By building the index once per session and storing only the bounding envelopes (which is what STRtree does internally), you eliminate the $O(N)$ per-test memory spike. When full DataFrame operations are unavoidable, use pyogrio.read_dataframe with skip_features / max_features for windowed reads rather than loading the entire file. For spatial lookups, leverage Shapely’s STRtree to cache bounding boxes without materializing full geometries on every query.

Spatial Assertion Strategy & Test Pyramid Alignment

Structuring spatial tests requires strict adherence to validation tiers. Aligning your suite with Understanding the GIS Test Pyramid ensures that heavy integration tests only execute after fast, deterministic unit checks pass.

At the base, unit tests validate coordinate transformations, projection math, and geometry constructors using lightweight mocking strategies. Mid-tier integration tests apply spatial assertion patterns to verify topology rules, CRS alignment, and attribute schema compliance. Heavy end-to-end validations run against cached production snapshots, enforcing strict scoping rules to prevent redundant full-dataset scans and assertion drift.

# tests/geo/test_topology.py
from shapely.geometry import shape
from shapely.validation import explain_validity

def test_polygon_validity(spatial_index):
    """Validate topology for each feature in the session-scoped index."""
    tree, features = spatial_index
    invalid = []
    for feat in features:
        geom = shape(feat["geometry"])
        if not geom.is_valid:
            invalid.append((feat["id"], explain_validity(geom)))
    assert not invalid, f"Invalid geometries detected: {invalid[:5]}"

CI/CD Orchestration & Parallel Execution

Parallelizing spatial tests introduces race conditions around shared file handles and temporary index files. Use pytest-xdist with explicit worker isolation and session-scoped fixtures that generate read-only copies per worker. Configure pyproject.toml to enforce strict timeouts and disable flaky network-dependent assertions. Refer to the official pytest-xdist documentation for worker distribution strategies and --dist modes.

# pyproject.toml
[tool.pytest.ini_options]
addopts = "-n auto --timeout=300 --strict-markers"
markers = [
    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
    "integration: requires full dataset cache"
]

Implement a deterministic cache policy in your CI runner to mount read-only shapefile bundles. When cache misses occur, fall back to synthetic generators that produce statistically representative geometries rather than downloading multi-gigabyte archives. Isolate worker environments using ephemeral containers or isolated temp directories to prevent cross-test state pollution.

Security & Scoping Boundaries

Processing untrusted spatial data in automated pipelines introduces path traversal, CRS injection, and malformed geometry risks. Enforce strict security boundaries by validating file signatures before ingestion, restricting fiona/gdal driver capabilities, and sandboxing test runners in ephemeral containers. Always normalize CRS inputs to a canonical EPSG code before running spatial joins, and strip external metadata that could leak pipeline secrets.

By combining lazy fixture loading, strict assertion scoping, and hardened CI orchestration, teams can validate enterprise-scale shapefiles without compromising pipeline velocity or infrastructure stability.