Skip to main content

A Pants Plugin for Automatically Collecting Python Sources

·1508 words·8 mins
Author
Kamil Muszyński

Background
#

Imagine a data science team running experiments on Databricks clusters, with a growing shared library - utils, ML models, feature transformations. The convenient way to get it onto a cluster is to package it as a Python wheel and install it via the Databricks UI: just spin up a new cluster, install a single wheel, and start experimenting.

I ran into this exact setup when migrating a codebase to a monorepo managed by Pants, and discovered it was surprisingly tricky to make it work as expected.


The Problem
#

Let’s take a look at a very simple package layout:

mypkg/
  __init__.py
  foo.py
  BUILD

When you use Pants to build mypkg as a Python wheel, you define a python_distribution target and list its dependencies. For a simple package with a single directory, this looks simple:

# mypkg/BUILD
python_sources()

python_distribution(
    name='wheel',
    dependencies=[':mypkg'],  # all .py files, as discovered by python_sources()
    provides=python_artifact(name='mypkg', version='0.0.1'),
)

The :mypkg address refers to the files discovered by the python_sources() target from the same BUILD file - the : prefix means “this directory”, and mypkg is the auto-assigned name (derived from the directory name).

In this simple layout, everything looks nice and easy. Now, let’s complicate it a little bit - the moment your package grows with nested subpackages, the problem becomes more visible.

mypkg/
  __init__.py
  foo.py
  BUILD
  nested/
    __init__.py
    bar.py
    BUILD        ← its own python_sources() target, mypkg/nested

Now you have to manually list the nested target too, otherwise it won’t be included in the wheel:

python_distribution(
    name='wheel',
    dependencies=[
        ':mypkg', 
        'mypkg/nested'  # new dependency
    ],
    ...
)

So, as you can see, you have to explicitly list at least the directories with your source code. And as the library grows, team members will add more subdirs, and forget to add them to the dependencies list, and then have a wheel built and installed without the functionality they need.

Annoyingly, Pants doesn’t support globs or recursive specs (like mypkg::) in dependencies fields - that syntax only works on the command line.

Why Not Just Use sources=["**/*.py"]?
#

Pants’ built-in python_sources() accepts a sources glob, so you could write:

python_sources(sources=["**/*.py"], name='everything')

This makes a single target claim all .py files under the directory recursively, which means you can drop the nested BUILD files entirely. It solves the problem, but you lose the ability to target subdirectories independently (e.g. pants <goal> mypkg/nested:: stops working). You also lose the ability to have nested BUILD files with custom config, e.g. skipping mypy for subdirs.

So, what else can we do? It turns out Pants is quite extensible - with a bit of digging, you can add a custom plugin and create a new target type just for our use case.


The Idea
#

The goal is to have a single target that automatically aggregates all Python sources under a directory (equivalent of mydir::), and then use it as a dependency for python_distribution instead of listing every subdirectory by hand.

To achieve this, we use Pants’ dependency inference - a plugin point where you can register custom rules that run at build time to add dependencies to a target. Our rule automatically receives the full list of all targets in the repository at build time, and filters it down to only python_sources targets found under our specific root - those become the inferred dependencies of our python_library target.

The Solution: A python_library Plugin
#

Let’s first see how the final BUILD file looks like:

# mypkg/BUILD
python_sources()

python_library(
    name='lib',
    root='mypkg',  # points to `mypkg` root dir, fetching all nested python source targets
)

python_distribution(
    name='wheel',
    dependencies=[':lib'],   # only one dependency on our new custom target defined by `python_library`
    provides=python_artifact(name='mypkg', version='0.0.1'),
)

This way, the wheel building target automatically picks up any future source files added to mypkg. Below you can find detailed steps on how to create such a plugin.

Project Structure Overview
#

Here is a sample repository structure. The plugin code is located in pants-plugins/, at the repo root, alongside your application code. Pants loads it as any regular Python package.

pants-plugins/
  python_library/
    __init__.py
    target_types.py
    rules.py
    register.py
mypkg/
  BUILD
  __init__.py
  foo.py
  nested/
    BUILD
    __init__.py
    bar.py
pants.toml

Step 1: Define the Target Type
#

We define a python_library target that acts as a container for all source targets under root, which we then pass as a dependency to python_distribution:

# pants-plugins/python_library/target_types.py
from pants.engine.target import (
    COMMON_TARGET_FIELDS,
    Dependencies,
    StringField,
    Target,
)


class PythonLibraryRootField(StringField):
    alias = "root"
    required = True
    help = "Root directory to recursively collect python_sources targets from."


class PythonLibraryTarget(Target):
    alias = "python_library"
    core_fields = (*COMMON_TARGET_FIELDS, PythonLibraryRootField, Dependencies)
    help = (
        "Collects all python_sources targets under `root` recursively. "
        "Use as a dependency in python_distribution to avoid manually listing subdirectories."
    )

What this does:

  • PythonLibraryTarget registers the python_library symbol in BUILD files. core_fields declares what fields the target accepts. COMMON_TARGET_FIELDS provides standard fields like tags and description that all targets should support. Dependencies is the standard Pants field that holds explicitly listed deps (still useful if someone wants to mix manual and inferred deps). The alias is what appears in BUILD files: python_library(...).
  • PythonLibraryRootField defines a new root field we can use as keyword in our new python_library target. We will read the value of this field in rules.py.

Step 2: Write the Dependency Inference Rule
#

This is the core of the plugin. Here is how we hook into Pants’ inference system to make python_library discover its sources:

# pants-plugins/python_library/rules.py
from dataclasses import dataclass

from pants.backend.python.target_types import PythonSourcesGeneratorTarget, PythonSourceTarget
from pants.engine.rules import collect_rules, rule
from pants.engine.target import (
    AllTargets,
    FieldSet,
    InferDependenciesRequest,
    InferredDependencies,
)
from pants.engine.unions import UnionRule

from .target_types import PythonLibraryRootField


@dataclass(frozen=True)
class PythonLibraryFieldSet(FieldSet):
    required_fields = (PythonLibraryRootField,)
    root: PythonLibraryRootField


class InferPythonLibraryDependencies(InferDependenciesRequest):
    infer_from = PythonLibraryFieldSet


@rule
async def infer_python_library_deps(
    request: InferPythonLibraryDependencies,
    all_targets: AllTargets,
) -> InferredDependencies:
    root = request.field_set.root.value

    addresses = [
        t.address
        for t in all_targets
        if (
            t.address.spec_path == root or t.address.spec_path.startswith(root + "/")
        )
        and isinstance(t, (PythonSourcesGeneratorTarget, PythonSourceTarget))
    ]

    return InferredDependencies(addresses)


def rules():
    return [
        *collect_rules(),
        UnionRule(InferDependenciesRequest, InferPythonLibraryDependencies),
    ]

There are three things happening here:

Targeting the right targets. PythonLibraryFieldSet tells Pants which targets this rule applies to - a target is eligible only if it has all required_fields. Since only our python_library targets have PythonLibraryRootField, the rule won’t fire for anything else. InferPythonLibraryDependencies is the request class that connects the FieldSet to Pants’ inference mechanism.

The rule itself. The @rule-decorated function is where the work happens. Pants provides AllTargets (every target in the repo) automatically as a rule parameter - we filter it down to python_sources/python_source targets whose path starts with root, skipping distributions, tests, and anything else, then return their addresses as inferred dependencies.

Plugging it in. UnionRule registers our request into Pants’ dependency inference system. Without it, the rule function exists but is never triggered.

Step 3: Register the Plugin
#

# pants-plugins/python_library/register.py
from . import rules as _rules_module
from .target_types import PythonLibraryTarget


def target_types():
    return [PythonLibraryTarget]


def rules():
    return _rules_module.rules()

register.py is the entry point Pants looks for in every backend listed in backend_packages in pants.toml. The target_types() and rules() hooks return lists that Pants merges with all other backends - this is how python_library becomes available in BUILD files and how the inference rule gets loaded into the engine.

Step 4: Configure pants.toml
#

Two new lines are needed - we tell Pants where our plugin source is, and we explicitly load it in the backends list:

[GLOBAL]
pants_version = "2.31.0"
pythonpath = ["%(buildroot)s/pants-plugins"]   # make pants-plugins/ importable

backend_packages = [
  "pants.backend.python",
  "pants.backend.python.lint.black",
  "pants.backend.python.lint.flake8",
  "pants.backend.python.lint.isort",
  "python_library",                             # load our plugin; matches pants-plugins/python_library dir
]

[python]
interpreter_constraints = ["CPython==3.13.*"]

Adding "python_library" to the backends list causes Pants to call python_library.register.target_types() and python_library.register.rules() at startup, registering our new target.

Step 5: Update the BUILD File
#

# mypkg/BUILD
python_sources()

python_library(
    name='lib',
    root='mypkg',
)

python_distribution(
    name='wheel',
    dependencies=[':lib'],
    provides=python_artifact(
        name='mypkg',
        version='0.0.1'
    ),
)

:lib is the address of the python_library target in the same BUILD file. When pants package mypkg:wheel runs, Pants resolves :lib’s dependencies by running the inference rule, which walks AllTargets and returns the addresses of mypkg:mypkg and mypkg/nested:nested. The wheel is then built with both included.

Every time a new subdirectory is added with a BUILD containing python_sources(), it appears in AllTargets and is automatically included. The distribution target never needs to change.

Summary
#

Writing this took more digging than I expected. First there’s the non-obvious limitation that python_distribution doesn’t support recursive dependencies. Then, coming from zero knowledge of Pants internals, the plugin system felt quite complex at first - the docs cover it well, but across several pages, and I couldn’t find a concrete example doing exactly this.

So I wrote this mostly for myself - to have it in one place. Hopefully someone else finds it useful too.

For production deployments where installation time and wheel size matter, you probably want something more targeted (like pex files with automatic dependency pruning) - but that is probably a different story…

Example Repository
#

The full working code from this article is available at github.com/kmuszyn/pants-lib-plugin.

References
#