Reasoning

In language models, reasoning (also known as model thinking) refers to a chain-of-thought (CoT) technique that mirrors human problem-solving through step-by-step analysis. SillyTavern provides several features that make the use of reasoning models more efficient and consistent across supported backends.

Common issues

When using reasoning models, the model's internal reasoning process consumes part of your response token allowance, even if this reasoning isn't shown in the final output (e.g. o3-mini or Gemini Thinking). If you notice your responses are coming back incomplete or empty, you should try adjusting the Max Response Length setting found in the AI Response Configuration panel. For reasoning models, it's typical to use significantly higher token limits - anywhere from 1024 to 4096 tokens - compared to standard conversational models.

Configuration

Most reasoning-related settings can be configured in the "Reasoning" section of Advanced Formatting panel.

Reasoning blocks appear in the chat as collapsible message sections. They can be added manually, automatically by the backend, or through response parsing (see below).

By default, reasoning blocks are collapsed to save space. Click a block to expand and view its contents. You can set blocks to expand automatically by enabling Auto-Expand in the reasoning settings.

When a reasoning block is expanded, you can copy or edit its contents using the Copy and Edit buttons.

Some models support reasoning, but will not send their thoughts back. It is possible to still show the reasoning block with reasoning time for those by toggling the Show Hidden setting.

Adding Reasoning

Manually

Add a reasoning block to any message through the Message Edit menu. Click while editing to add a reasoning section. Third-party extensions can also add reasoning by writing to the extra.reasoning field of the message object before adding it to the chat.

With a Command

Use the /reasoning-set STscript command to add reasoning to a message. The command takes at (message ID, defaults to the last message) and reasoning text as arguments.

stscript
/reasoning-set at=0 This is the reasoning for the first message.

By Backend

If your chosen LLM backend and model support reasoning output, enabling "Request model reasoning" in the AI Response Configuration panel will add a reasoning block containing the model's thinking process.

Supported sources:

Claude
DeepSeek
Google AI Studio
Google Vertex AI
OpenRouter
xAI (Grok)
AI/ML API
Z.AI
Pollinations
MistralAI
Electron Hub
Chutes
NanoGPT
Moonshot

For most sources, "Request model reasoning" does not determine whether a model does reasoning as it can't be disabled. If the backend and model support explicitly requesting disabled reasoning, the setting will do so. Otherwise, the model will always reason.

Provider-specific notes:

Claude and Google (2.5 Flash) allow thinking mode to be toggled; see Reasoning Effort.
Reasoning can be disabled for Z.AI (GLM) and Moonshot (Kimi). The setting maps to the thinking.type parameter. They do not support "Reasoning Effort".
For OpenRouter, when the "Request model reasoning" toggle is deactivated with the minimal reasoning effort set, thinking will be set to disabled for models that support it. The behavior is model-dependent; certain providers may reject such requests.

By Parsing

Enable "Auto-Parse" in the Advanced Formatting panel to automatically parse reasoning from the model's output.

The response must contain a reasoning section wrapped in configured Prefix and Suffix sequences. The sequences provided by default correspond to the DeepSeek R1 reasoning format. This is required to be enabled for some API sources that return unparsed reasoning, such as MiniMax or Perplexity.

Example with prefix <think> and suffix </think>:

<think>
This is the reasoning.
</think>

This is the main content.

Prompting with Reasoning

By default, recognized reasoning block contents are not sent back to the model. To include reasoning in prompts, enable "Add to Prompts" in the Advanced Formatting panel. Reasoning content will be wrapped in configured Prefix and Suffix sequences and separated by a Separator from the main context. The Max Additions numeric setting controls how many reasoning blocks can be included, counting from the end of the prompt.

Most model providers do not recommend sending CoT back to the model in multi-turn conversations.

Continuing from Reasoning

A special case when the reasoning can be sent back to the model without having the "Add to Prompts" toggle enabled is when the generation is continued (e.g. by pressing "Continue" from the Options menu), but the message being continued contains only the reasoning without actual content. This gives the model an opportunity to finish an incomplete reasoning and start generating the main content. The prompt will be sent as follows:

<think>
Incomplete reasoning...

Regex Scripts

Regular expression scripts from the Regex extension can be applied to the contents of reasoning blocks. Check "Reasoning" in the "Affects" section of the script editor to target reasoning blocks specifically.

Different ephemerality options affect reasoning blocks in the following ways:

No ephemerality: reasoning content is permanently changed.
Run on edit: regex script will be re-evaluated when the reasoning block is edited.
Alter chat display: regex is applied to the reasoning block's display text, not the underlying content.
Alter outgoing prompts: regex is only applied to reasoning blocks before they are sent to the model.

Reasoning Effort

Reasoning Effort is a Chat Completion setting in the AI Response Configuration panel that influences how many tokens may potentially be used on reasoning. The effect of each option depends on the source connected to. For the sources below, Auto simply means the relevant parameter is not included in the request.

Option	Claude (≤ 21333 if no streaming)	OpenAI (keyword)	OpenRouter (keyword)	xAI (Grok) (keyword)	Perplexity (keyword)	NanoGPT (keyword)
Models	Opus 4, Sonnet 4/3.7	o4-mini, o3, o1	applicable models	grok-3-mini	sonar-deep-research	applicable models
Auto	not specified, no thinking	not specified	not specified, effect depends	not specified	not specified	not specified
Minimum	budgets 1024 tokens	"low"	"low", or 20% of max response	"low"	"low"	"none"
Low	15% of max response, min 1024	"low"	"low", or 20% of max response	"low"	"low"	"minimal"
Medium	25% of max response, min 1024	"medium"	"medium", or 50% of max response	"low"	"medium"	"low"
High	50% of max response, min 1024	"high"	"high", or 80% of max response	"high"	"high"	"medium"
Maximum	95% of max response, min 1024	"high"	"high", or 80% of max response	"high"	"high"	"high"

For older Claude models that don't support adaptive thinking, budget is capped to 21333 if streaming is disabled. If the calculated budget would be less than 1024, then max response is changed to 2048.
Claude also supports adaptive thinking for Opus 4.6+ models, which can be enabled via claude.enableAdaptiveThinking in config.yaml (always on for Opus 4.7+). When enabled, the Reasoning Effort setting maps to adaptive thinking levels instead of token budgets. This setting takes precedence over the "Verbosity" setting for applicable models.
For OpenRouter, Pollinations, Perplexity, xAI, Chutes, DeepSeek, AI/ML API, xAI, Electron Hub, only an OpenAI-style keyword is sent.
For GPT-5.4 and GPT-5.5 models on OpenAI, "Minimal" reasoning effort corresponds to "none", which disables reasoning.
For KoboldCpp running as a Chat Completion Custom API source, reasoning effort is sent as a reasoning_effort parameter with values "minimal", "low", "medium", "high", and "xhigh".
For other Custom (OpenAI-compatible) sources, a reasoning effort is sent only if the model supports it on the official OpenAI source.

Google AI Studio and Vertex AI are as follows:

Model	Auto (dynamic thinking)	Minimum	Low	Medium	High	Maximum
2.5 Pro	thinkingBudget = -1	128	15% of max response, min 128	25% of max	50% of max	lower of max or 32768
2.5 Flash	thinkingBudget = -1	0, no thinking	15% of max response	25% of max	50% of max	lower of max or 24576
2.5 Flash Lite	thinkingBudget = -1	0, no thinking	15% of max response, min 512	25% of max	50% of max	lower of max or 24576
3.0/3.1 Pro	thinkingLevel = null	"low"	"low"	"low"	"high"	"high"
3.0/3.1 Flash	thinkingLevel = null	"minimal"	"low"	"medium"	"high"	"high"

For Gemini 2.5 Pro and 2.5 Flash/Lite, budget is capped to 32768 or 24576 tokens respectively, regardless of the streaming setting.

>=1.12.12