(I've been getting solid results recently from simply telling Claude Code and Codex "Test with uv run pytest, use red/green TDD".)
# Python Tooling
- Use `uv` to manage Python environments and dependencies.
- Use `uv run` to execute Python scripts and commands.
- Use `pytest` for testing your code.
- Use the `hypothesis` library for property-based testing when you have complex input spaces or need to test edge cases.
- Don't edit `pyproject.toml` directly. Instead, use `uv add` and `uv add --dev` to manage dependencies.
- Use ruff, ty, prek, wily for code quality and linting.
- Don't use excessive casting. If you find yourself needing to cast types frequently, consider refactoring your code to use more appropriate types. Casting should only be done in boundary layers where you are interfacing with external systems.
- Run appropriate tooling after making changes to your code to ensure it meets quality standards.
- When you come across a bug or regression, think hard about writing a test and also how to create code that will prevent this from happening again in the future.
- When creating a command line interface, add `--verbose` flag that provides logging output useful for debugging issues.
- Before creating code, brainstorm 5 different approaches to solve the problem and sort them by their probable effectiveness. Then, choose the best approach and implement it.
- Use Test Driven Development (TDD) for all code you write. Write tests before writing the implementation code.
- Collect pytest fixtures in a `conftest.py` file to avoid duplication
- Prefer testing real code where possible. Use doubles and `monkeypatch` when absolute necessary. Try to avoid mocking as much as possible.
- Favor pytest monkeypatch to mock.
- When a test fails, run the last failed test first using `uv run pytest --last-failed`
- Use numpy-style docstrings for all functions and classes you create.
- Include doctests in the docstrings of your functions to provide examples
- Use type hints for all function parameters and return types.
- Use logging to provide insight into failures. Don't use print for debugging. Don't use logging to hide stack traces.As a personal anecdote, I find that a lot of big prompts and skills use up context window budget and in many cases agents will eagerly try to use a skill even if it isn't super relevant or necessary for the current task. So when I have too many skills I have to spend a bunch of time toggling the checkboxes to figure out which ones are needed for the task at hand before starting...
The waterfall approach is better after trying out TDD especially when you have a multi-agent setup. Also I found that in some cases the tests were just superficial hallucinations that never actually tested the components written or there some some context corruption and ultimately triggered a false positive that kicked off a completely unintentional refactoring.
Crazy times here in the development world. I'm always curious to watch other's best practices.
Almost all the breakages after a big refactor are stale assertions but every time I catch a couple of critical problems that make the entire exercise very worth it.
The whole dev process is so fast compared to writing software manually that I find it absurd that I wouldn’t invest heavily in automated tests.
TLDR; it found test-writing volume only weakly correlates with success and that encoding test-writing principles did not move resolution rates but _did_ materially change cost. Encouraging tests cost +19.8% output tokens for 0% gain; discouraging them saved 33–49% input tokens for ≤2.6pp accuracy loss. Separately, imposing the TDD procedure specifically seems like it can backfire: it actually _increased_ regressions from 6.08% to 9.94%.
IMO, where tests clearly help is primarily as an "oracle" applied after generation. It gives the models a signal that enables them to verify and self-correct if necessary.
I have to push back on the idea that token costs balloon when using TDD within the context of a strong framework such as Jason has laid out here.
If the feature is repurposed/removed/refactored....I'd argue the specification wasn't well thought out prior to burning into tokens.
We're so eager to do a lot of the wrong things quickly, when it may serve us better to do a more precise thing slowly.
And the code will be good.
If this is encoded in a skill, that skill essentially has to be loaded for everything thing your LLM is doing. This is probably one of the few areas where direct instructions via AGENTS.md is best, and I don't believe it requires much direction here to force the issue.
But I think the OP is just trying to have their agent work in a very specific way -- that is fine too.
> 5. Show me the test and ask for approval before continuing
But everybody is free to choose how they work and it may be required in ways that we can't know about.
The latest one is with "Uncle Bob Martin" who has some interesting takes on coding with AI from .... can I say an oldie?
Even more so when coding with agents. I think it is the probably the biggest lever to keep AI in guardrails.
(It's also why I wrote my latest book, Effective Testing, because I routinely find that my clients are very poor at treating.)
All of this burns more tokens of course, but probably way less than coming back to the code later to fix bugs. It is also slower, but in the long run saves time.
Skills are literally just Markdown documents that get loaded into context when the /skill-name is invoked.
they are being sold as more powerful than they are. Like llms are intelligent blank slates that can be customized with mere markdown files.
Taken to the extreme, the attitude that there is some special incantation that will unlock all capabilities is silly, and a lot of the "prompt engineering" discourse is similarly kind of dumb, but in-context learning is clearly a real thing.
you are treating skill like sure thing
The token cost and tech debt introduced by tests is just not worth it. There's usually no bugs and if there are, you can fix them quickly if and when it's needed.
Testing was and is still very important, as LLMs can still miss important points in business logic or other edge cases I would argue that tests became as important as code, if not more.