Image Captioning allows SillyTavern to automatically generate text descriptions for images used in chats.
Use Image Captioning when you want your AI character to "see" and respond to visual content in your conversations.
- Create captions for images you upload or paste into messages
- Add context to existing images in the chat history
- Use various sources for generation, including local models, cloud APIs, and crowdsourced networks
There are options that require no setup, no money, and no GPU. There are also options that require some or all of those things. Choose the one that fits your needs and resources.
The image captioning extension is built-in to SillyTavern and does not need to be installed separately.
#Quick start
- Set up:
- Open the Image Captioning panel in the Extensions panel
- Choose a captioning source (most likely "Local" or "Multimodal")
- For "Multimodal" ensure you've set up the connection in the API Connections tab
- Generate a caption:
- Choose "Generate Caption" from the Extensions popup menu
- Select an image file when prompted
- Wait for the caption to be generated
- Review and send:
- The captioned image will be inserted into your message
- See the caption using the image tooltip
- Click Send to see what your character thinks of the image!
#Panel controls
#Source Selection
Choose the source for image captioning. Supported options:
- Caption Prompt: Enter a custom prompt for captioning. The default prompt is "What's in this image?"
- Ask every time: Toggle to request a custom prompt for each image caption
#Message Template
- Message Template: Customize the caption message template. Use
{{caption}}
macro to insert the generated caption. The default template is[{{user}} sends {{char}} a picture that contains: {{caption}}]
- Automatically caption images: Toggle to enable automatic captioning of images pasted or attached to messages
- Edit captions before saving: Toggle to allow editing captions before they are saved
All the ways to caption images in SillyTavern:
- Choose "Generate Caption" from the Extensions popup menu and select an image file when prompted
- Click the Caption icon at the top of an image already in a message
- Paste an image directly into the chat input with auto-captioning enabled
- Attach an image file to a message using the Embed File or Image button in the actions of a message.
- Send a message with an embedded image
- Use the
/caption
slash command
The auto-captioning feature allows you to automatically generate captions for images as they are added to the chat, without manually triggering the captioning process each time.
To enable, select the "Automatically caption images" checkbox in the Image Captioning panel. You can also choose to edit captions before they are saved by checking the "Edit captions before saving" box.
Once enabled, auto-captioning will trigger in the following scenarios:
- When an image is pasted directly into the chat input.
- When an image file is attached to a message.
- When a message with an embedded image is sent.
The system will use your selected captioning source (Local, Extras, Horde, or Multimodal) and the configured settings to generate a caption for the image.
If you've enabled the "Edit captions before saving" option:
- After an image is added, a popup will appear with the generated caption.
- You can review and edit the caption as needed.
- Click "OK" to apply the caption, or "Cancel" to discard the caption without saving.
The generated (and optionally edited) caption will be automatically inserted into the prompt using the Message Template you've configured. By default, it will be sent in this format:
The extension provides a /caption
slash command to use in the chatbox or in scripts.
#Usage
prompt
(optional): A custom prompt for the captioning model. Only supported by multimodal sources.quiet=true|false
: If set to true, suppresses sending a captioned message to the chat. Default is false.mesId=number
: Specifies a message ID to caption an image from an existing message instead of uploading a new one.
If no mesId
is provided, the command will prompt you to upload an image. When quiet
is false (default), a new message with the captioned image will be sent to the chat. The generated caption can be used as input for other commands.
#Examples
Caption a new image with the default settings:
Caption a new image with a custom prompt:
Caption an image from message #5 without sending a new message:
Caption an image from message #10 with a custom prompt then generate a new image based on the caption:
#Local source
You can change the model in config.yaml. The key is called extras.captioningModel
because reasons. Enter the Hugging Face model ID you want to use. The default is Xenova/vit-gpt2-image-captioning
.
You can use any model that supports image captioning (VisionEncoderDecoderModel
or "image-to-text" pipeline). The model needs be to compatible with the transformers.js library. That is, it needs ONNX weights. Look for models with the ONNX
and image-to-text
tags, or that have a folder called onnx
full of .onnx
files.
#Multimodal source
#General configuration
- Model: Choose the model for image captioning. Options vary based on the selected API.
- Allow reverse proxy: Toggle to allow using a reverse proxy if defined and valid (OpenAI, Anthropic, Google, Mistral)
API keys and endpoint URLs for captioning sources are managed in the API Connections panel. Set the connection up in API Connections first, then select it as your captions source in Captioning.
Set it up in the API Connections panel first
One last time: configure the API key/address/port in API Connections and use the connection in Captioning.
You can still use Claude for chats and Google AI Studio for image captioning, or whatever. Just set them both up in the 'API Connections' tab first. Then flip your Chat Completion source to Claude and your Captioning source to Google AI Studio.
For most local backends, you will need to set some options in the model backend rather than in SillyTavern. If your backend can only run one model at a time and doesn't support automatic switching, you are unfortunately going to have a hard time using the same backend for chat and captioning with different models.
Even if you run two instances of the backend on different ports, API Connections only allows one active configuration per backend type. But what if I told you... that you can probably connect to your backend in both Text Completion and Chat Completion modes? Now you can have two connections to the same backend type.
#Sources
To use one of these caption sources, select Multimodal in the Source dropdown.
- "I want the best captioning possible, and I don't mind paying for it": Anthropic
- "I don't want to pay anything or run anything": Google AI Studio free tier
- "I want to caption images locally and have it just work": Ollama
- "I want to keep the dream of local AI alive": KoboldCpp
- "I want to complain when it doesn't work":
Extras
#KoboldCpp
For general information on installing and using KoboldCpp, see the KoboldCpp documentation.
To use KoboldCpp for multimodal captioning:
- get a multimodal-capable model, trained to process text and image prompts at the same time.
- also get the multimodal projections for the model. These weights allow the model to understand how the text and image parts of the input relate to each other.
- load the model and projections in the KoboldCpp launch GUI or command line interface.
The original and classic local multimodal model is LLaVA. GGUF-format files for the model and projections are available from Mozilla/llava-v1.5-7b-llamafile. To load them from the command line, set the model and projections with the --model
and --mmproj
flags. For example:
Some LLaVA finetunes you can try: xtuner/llava-llama-3-8b-v1_1-gguf, xtuner/llava-phi-3-mini-gguf.
You can use multimodal projections for the base model that your particular finetune was built from. Projections for some common base models are available from koboldcpp/mmproj.