anything-llm/server/storage/models
Timothy Carambat ac6ca13f60
1173 dynamic cache openrouter (#1176)
* patch agent invocation rule

* Add dynamic model cache from OpenRouter API for context length and available models
2024-04-23 11:10:54 -07:00
..
downloaded docs: placeholder for model downloads folder (#446) 2023-12-14 10:31:14 -08:00
.gitignore 1173 dynamic cache openrouter (#1176) 2024-04-23 11:10:54 -07:00
README.md Support external transcription providers (#909) 2024-03-14 15:43:26 -07:00

Native models used by AnythingLLM

This folder is specifically created as a local cache and storage folder that is used for native models that can run on a CPU.

Currently, AnythingLLM uses this folder for the following parts of the application.

Embedding

When your embedding engine preference is native we will use the ONNX all-MiniLM-L6-v2 model built by Xenova on HuggingFace.co. This model is a quantized and WASM version of the popular all-MiniLM-L6-v2 which produces a 384-dimension vector.

If you are using the native embedding engine your vector database should be configured to accept 384-dimension models if that parameter is directly editable (Pinecone only).

Audio/Video transcription

AnythingLLM allows you to upload various audio and video formats as source documents. In all cases the audio tracks will be transcribed by a locally running ONNX model whisper-small built by Xenova on HuggingFace.co. The model is a smaller version of the OpenAI Whisper model. Given the model runs locally on CPU, larger files will result in longer transcription times.

Once transcribed you can embed these transcriptions into your workspace like you would any other file!

Other external model/transcription providers are also live.

Text generation (LLM selection)

Important

Use of a locally running LLM model is experimental and may behave unexpectedly, crash, or not function at all. We suggest for production-use of a local LLM model to use a purpose-built inference server like LocalAI or LMStudio.

Tip

We recommend at least using a 4-bit or 5-bit quantized model for your LLM. Lower quantization models tend to just output unreadable garbage.

If you would like to use a local Llama compatible LLM model for chatting you can select any model from this HuggingFace search filter

Requirements

  • Model must be in the latest GGUF format
  • Model should be compatible with latest llama.cpp
  • You should have the proper RAM to run such a model. Requirement depends on model size.

Where do I put my GGUF model?

Important

If running in Docker you should be running the container to a mounted storage location on the host machine so you can update the storage files directly without having to re-download or re-build your docker container. See suggested Docker config

Note

/server/storage/models/downloaded is the default location that your model files should be at. Your storage directory may differ if you changed the STORAGE_DIR environment variable.

All local models you want to have available for LLM selection should be placed in the server/storage/models/downloaded folder. Only .gguf files will be allowed to be selected from the UI.