Model-Per-Mode Architecture
There is a default answer that is almost always wrong: "use the best model for everything."
It sounds responsible. Use the smartest, most capable model. Pay for the top tier. Be done with the decision.
It is wrong because "the best model" is not a property of a model. It is a property of a (model, task, mode) triple. The model that excels at code generation may be too eager in workshop mode. The model that holds workshop discipline beautifully may be too slow for action mode. The model with the largest context window may have the wrong gradient temperature for review work. The cheapest model may be exactly right for the mode you are actually in.
This workshop is about what model selection is really choosing between, with an applied lab that takes your real modes and your real available models and builds a routing matrix you actually consult at mode-switch time. You will not leave with the best model identified. You will leave with the understanding that "best" depends on the mode and the trade you are willing to make.
The outside name for this problem
The broader industry calls this LLM routing. Routing systems choose a model based on the task, cost, latency, quality requirement, and sometimes user intent. IBM calls this frugal routing when smaller or specialized models are tried before larger expensive ones. AWS, NVIDIA, and routing startups all describe the same basic direction: stop sending every request to one default model.
JKE's version is operator-mode routing. It is not just API economics. It asks: what mode are we in, what failure would hurt, how much context is needed, and what model temperament matches the work?
The five axes that matter
When you compare two models, you are comparing them along at least five axes. Pretending the comparison is one-dimensional is the root error.
Gradient temperature. How hard does the model push toward completion? Some models sit still naturally — they ask, they pause, they hold open questions. Others are hot — they want to act. Hot is great for execution; bad for workshop. Cool is great for workshop; sometimes too sluggish for action.
Cost per token. Input cost and output cost, separately. Some models are cheap to send tokens to but expensive to generate from. Some are the reverse. The mode determines which matters.
Context window. How much can the model hold? A 1M window is wasted on a mode that only ever uses 20K. A 128K window is too small for a synthesis mode that consumes 500K of context.
Known strengths per mode. Code reasoning. Long-form drafting. Mathematical accuracy. Tool use. Multi-step planning. Each model has a profile across these.
Known weaknesses per mode. The flip side. Where does this model under-perform? In which modes does its gradient temperature mismatch the work?
A routing decision that weighs only one axis is a routing decision you will revisit when the mode changes.
What "mode" actually means
Mode is a band of work with shared assumptions about what the agent should and should not do.
Action mode. Execute. Ship. The operator has authorized the work. The agent's job is to complete it cleanly. Gradient temperature: hot is fine, even useful. Context: usually small. Cost: depends on volume.
Workshop mode. Think. Argue. Stay open. The operator is exploring; the agent is not yet authorized to act. Gradient temperature: cool. Hot models leak action behavior into workshop. Context: often large (because you're considering many possibilities).
Research mode. Search. Compare. Survey. The operator wants outside information assembled. Gradient temperature: medium. Context: large. Tool use: critical.
Drafting mode. Produce a creative or written artifact. Gradient temperature: cool with bursts of execution. Context: medium. The lineup test (Course 201) usually applies.
Review mode. Evaluate someone else's (or your own prior session's) work. Gradient temperature: cool. Context: medium. The validation trap (Course 202) usually applies.
Code-build mode. Heavy multi-file engineering. Gradient temperature: matters per sub-task. Context: usually large. Tool use: critical.
Your operation may have other modes. The point is not the list. The point is that modes have profiles, and profiles imply model preferences.
The mechanism — why one model fails across modes
The gradient temperature of a model is set in training. RLHF reward functions push the model toward certain response patterns: eager, decisive, helpful, complete. Different training regimes produce different temperatures. The temperature is not a setting; it is a property of the weights.
A hot model in workshop mode will keep trying to act. The operator will keep recalibrating. The recalibrations will hold for a few turns, then the temperature reasserts. You can write rules, gates, mode-switching protocols — they all fight the temperature, and the temperature is structural.
A cool model in action mode will pause, ask, hedge. The operator will keep saying "yes, go, just do it." The pauses hold the work back. You can write rules pushing for execution — they help, but they fight against the cool temperature, and the temperature is structural.
The right answer is not to fight the temperature. The right answer is to choose a model whose temperature matches the mode. That requires the matrix.
The scar tissue — model defaults misalign
The pattern: an operator picks one model as default. That model is great at one mode. The operator runs the agent in that mode and is delighted. The operator switches modes for a different task. The same model is now mismatched. The operator notices the agent feels "off" but cannot point to a file change.
Because there wasn't one. The model is the same. The mode is different. The mismatch is what the operator is feeling.
Without a routing matrix, the operator has two options: tolerate the mismatch, or rewrite the system prompt every mode switch. Both are expensive over time. The matrix is the cheaper option once it exists.
The exercise: build your matrix
Take a sheet of paper or a fresh file. Two axes.
Rows: the modes you actually run. Be honest. If you only ever run action and code-build, that is two rows. Don't list aspirational modes.
Columns: the models you actually have access to. Again, honest. Don't list models you cannot afford to use on volume.
For each cell, fill in: - Gradient temperature (hot / warm / cool). - Cost per million tokens (in / out). - Context window. - One sentence: strength in this mode. - One sentence: weakness in this mode.
The matrix will reveal something. Some cells will be obviously right. Some will be obviously wrong. The interesting cells are the ones that are "tolerable" — where you've been routing by default and the choice was almost right but not quite.
Those are the cells where the matrix changes your behavior.
The new-model test plan
When a new model becomes available, do not adopt it as default.
Run it through your actual modes. The same prompts you'd use in real work, run in each mode, with the new model.
- Action sample: does it execute cleanly without leaking workshop behavior? - Workshop sample: does it hold workshop without drifting into action? - Research sample: does it search rather than fabricate? - Review sample: does it surface real issues, or does it pleasantly agree?
Log results. Update the matrix. Propose routing changes if any cell improved.
The test plan is the discipline that prevents "new model is shiny, adopt as default" from corrupting your routing.
The tinkering questions
- Pick a recent mode switch you made. Did you switch models too? Should you have? - Find a mode you operate in but rarely think about model choice for. What is your current default doing in that mode? - Imagine the cheapest model on your list. Which modes could it actually handle? You might be over-paying for some routine work. - Imagine the most expensive model on your list. Which modes does it not actually improve? You might be over-paying for non-improvement. - Notice the gradient temperature of your current session. Does it match the mode you're in?
Where the human owns the judgment
The agent builds the matrix. The agent fills the cells. The agent proposes a routing rule per mode. The operator confirms the routing — because the operator knows the actual stakes per mode that the matrix cannot encode.
The agent does not autonomously swap models mid-session. Model swaps are operator decisions, made deliberately, at mode-switch time. The agent surfaces the recommendation; the operator authorizes.
The essay as a context download
Model-per-mode should become a short essay the operator can point the agent at when the session changes shape.
Create work/model-routing-essay.md: a concise explanation of LLM routing, mode, cost, latency, quality, context, and failure consequence. When the agent is using the wrong model by habit, the operator can say: "Read the model-routing essay. What mode are we in, and what model does this mode deserve?"
The essay turns model choice from preference into routing discipline.
What to track
Keep a routing postmortem journal. Each time the matrix is consulted or a model is swapped, record:
- Mode entered. - Model chosen. - Routing rule that informed the choice. - What worked. - What did not. - Whether the model felt mismatched to the mode. - What the operator would change next time. - Any new model that came up for evaluation.
The journal will show you over time which routings actually hold and which were aspirational.
Working conclusion
There is no "best model" without a mode. There is only the right model for this mode at this stake. The matrix surfaces the trade. The operator decides. The new-model test plan keeps the matrix honest.
After this course, when you switch modes, you ask: what does this mode need? What model fits? Do I have a fallback if the primary is wrong? The default of "use the best model" is replaced with a routing decision you can actually defend.
Your Agent PDF
Your agent executes the PDF. You read the page. No copying. No manual setup.
Download Agent PDF — Course 208Your agent PDF is sent to the email used at checkout. If you have not received it, contact [email protected] with your order confirmation.