# Image Captioning

Image Captioning allows SillyTavern to automatically generate text descriptions for images used in chats.

Use Image Captioning when you want your AI character to "see" and respond to visual content in your conversations.

  • Create captions for images you upload or paste into messages
  • Add context to existing images in the chat history
  • Use various sources for generation, including local models, cloud APIs, and crowdsourced networks

There are options that require no setup, no money, and no GPU. There are also options that require some or all of those things. Choose the one that fits your needs and resources.

The image captioning extension is built-in to SillyTavern and does not need to be installed separately.

# Quick start

  1. Set up:
    • Open the Image Captioning panel in the Extensions panel
    • Choose a captioning source (most likely "Local" or "Multimodal")
    • For "Multimodal" ensure you've set up the connection in the API Connections tab
  2. Generate a caption:
    • Choose "Generate Caption" from the Extensions popup menu
    • Select an image file when prompted
    • Wait for the caption to be generated
  3. Review and send:
    • The captioned image will be inserted into your message
    • See the caption using the image tooltip
    • Click Send to see what your character thinks of the image!

# Panel controls

# Source Selection

Choose the source for image captioning. Supported options:

Source Description
Multimodal Cloud: OpenAI, Anthropic, Google, MistralAI, and others.
Local: Ollama, llama.cpp, KoboldCpp, Text Generation WebUI, and vLLM.
Supports custom prompts so you can ask your images questions.
Local Uses transformers.js running locally inside your SillyTavern server. Zero setup!
Horde Uses the AI Horde network, a crowdsourced distributed network of image generation models. Nothing to download, configure, or pay for. Variable response times.
Extras The Extras project was discontinued in April 2024 and is not maintained or supported.

# Caption Configuration

  • Caption Prompt: Enter a custom prompt for captioning. The default prompt is "What's in this image?"
  • Ask every time: Toggle to request a custom prompt for each image caption

# Message Template

  • Message Template: Customize the caption message template. Use {{caption}} macro to insert the generated caption. The default template is [{{user}} sends {{char}} a picture that contains: {{caption}}]

# Auto-captioning

  • Automatically caption images: Toggle to enable automatic captioning of images pasted or attached to messages
  • Edit captions before saving: Toggle to allow editing captions before they are saved

# Captioning images

All the ways to caption images in SillyTavern:

  • Choose "Generate Caption" from the Extensions popup menu and select an image file when prompted
  • Click the Caption icon at the top of an image already in a message
  • Paste an image directly into the chat input with auto-captioning enabled
  • Attach an image file to a message using the Embed File or Image button in the actions of a message.
  • Send a message with an embedded image
  • Use the /caption slash command

# Auto-Captioning

The auto-captioning feature allows you to automatically generate captions for images as they are added to the chat, without manually triggering the captioning process each time.

To enable, select the "Automatically caption images" checkbox in the Image Captioning panel. You can also choose to edit captions before they are saved by checking the "Edit captions before saving" box.

Once enabled, auto-captioning will trigger in the following scenarios:

  • When an image is pasted directly into the chat input.
  • When an image file is attached to a message.
  • When a message with an embedded image is sent.

The system will use your selected captioning source (Local, Extras, Horde, or Multimodal) and the configured settings to generate a caption for the image.

# Editing captions before saving (Refine Mode)

If you've enabled the "Edit captions before saving" option:

  1. After an image is added, a popup will appear with the generated caption.
  2. You can review and edit the caption as needed.
  3. Click "OK" to apply the caption, or "Cancel" to discard the caption without saving.

# Caption sending

The generated (and optionally edited) caption will be automatically inserted into the prompt using the Message Template you've configured. By default, it will be sent in this format:

[BaronVonUser sends Seraphina a picture that contains: ...]

# Slash Command: /caption

The extension provides a /caption slash command to use in the chatbox or in scripts.

# Usage

/caption [quiet=true|false]? [mesId=number]? [prompt]
  • prompt (optional): A custom prompt for the captioning model. Only supported by multimodal sources.
  • quiet=true|false: If set to true, suppresses sending a captioned message to the chat. Default is false.
  • mesId=number: Specifies a message ID to caption an image from an existing message instead of uploading a new one.

If no mesId is provided, the command will prompt you to upload an image. When quiet is false (default), a new message with the captioned image will be sent to the chat. The generated caption can be used as input for other commands.

# Examples

Caption a new image with the default settings:

/caption

Caption a new image with a custom prompt:

/caption Describe the main colours and shapes in this image

Caption an image from message #5 without sending a new message:

/caption mesId=5 quiet=true

Caption an image from message #10 with a custom prompt then generate a new image based on the caption:

/caption mesId=10 Describe this image using comma-separated keywords | /imagine 

# Local source

You can change the model in config.yaml. The key is called extras.captioningModel because reasons. Enter the Hugging Face model ID you want to use. The default is Xenova/vit-gpt2-image-captioning.

You can use any model that supports image captioning (VisionEncoderDecoderModel or "image-to-text" pipeline). The model needs be to compatible with the transformers.js library. That is, it needs ONNX weights. Look for models with the ONNX and image-to-text tags, or that have a folder called onnx full of .onnx files.

# Multimodal source

# General configuration

  • Model: Choose the model for image captioning. Options vary based on the selected API.
  • Allow reverse proxy: Toggle to allow using a reverse proxy if defined and valid (OpenAI, Anthropic, Google, Mistral)

API keys and endpoint URLs for captioning sources are managed in the API Connections panel. Set the connection up in API Connections first, then select it as your captions source in Captioning.

For most local backends, you will need to set some options in the model backend rather than in SillyTavern. If your backend can only run one model at a time and doesn't support automatic switching, you are unfortunately going to have a hard time using the same backend for chat and captioning with different models.

Even if you run two instances of the backend on different ports, API Connections only allows one active configuration per backend type. But what if I told you... that you can probably connect to your backend in both Text Completion and Chat Completion modes? Now you can have two connections to the same backend type.

# Sources

To use one of these caption sources, select Multimodal in the Source dropdown.

  • "I want the best captioning possible, and I don't mind paying for it": Anthropic
  • "I don't want to pay anything or run anything": Google AI Studio free tier
  • "I want to caption images locally and have it just work": Ollama
  • "I want to keep the dream of local AI alive": KoboldCpp
  • "I want to complain when it doesn't work": Extras
API Provider Description
01.AI (Yi) Cloud, paid, yi-vision
Anthropic Cloud, paid, Claude models with vision capabilities: claude-3-5-sonnet/haiku, claude-3-opus/sonnet
Custom (OpenAI-compatible) For custom OpenAI-compatible APIs, uses currently configured model in API Connections tab
Google AI Studio Cloud, free tier then paid, Gemini Flash/Pro
Groq Cloud, llama-3.2-vision in 11B/90B, LLaVA
KoboldCpp Local, must configure model in KoboldCpp
llama.cpp Local, must configure model in llama.cpp
MistralAI Cloud, paid, pixtral-large, pixtral-12B
Ollama Local, can switch between available models and download additional vision models within Captioning after configuring in API Connections
OpenAI Cloud, paid, GPT-4 Vision, 4-turbo, 4o, 4o-mini
OpenRouter Cloud, paid (maybe free options), many models, pick from what's available within Captioning after configuring in API connections
Text Generation WebUI (oobabooga) Local, must configure model in ooba
vLLM Local

# KoboldCpp

For general information on installing and using KoboldCpp, see the KoboldCpp documentation.

To use KoboldCpp for multimodal captioning:

  • get a multimodal-capable model, trained to process text and image prompts at the same time.
  • also get the multimodal projections for the model. These weights allow the model to understand how the text and image parts of the input relate to each other.
  • load the model and projections in the KoboldCpp launch GUI or command line interface.

The original and classic local multimodal model is LLaVA. GGUF-format files for the model and projections are available from Mozilla/llava-v1.5-7b-llamafile. To load them from the command line, set the model and projections with the --model and --mmproj flags. For example:

./koboldcpp \
--model="models/llava-v1.5-7b-Q4_K.gguf" \
--mmproj="models/ llava-v1.5-7b-mmproj-Q4_0.gguf" \
... other flags ...

Some LLaVA finetunes you can try: xtuner/llava-llama-3-8b-v1_1-gguf, xtuner/llava-phi-3-mini-gguf.

You can use multimodal projections for the base model that your particular finetune was built from. Projections for some common base models are available from koboldcpp/mmproj.