anything-llm/server/storage/models/README.md
Timothy Carambat 655ebd9479
[Feature] AnythingLLM use locally hosted Llama.cpp and GGUF files for inferencing (#413)
* Implement use of native embedder (all-Mini-L6-v2)
stop showing prisma queries during dev

* Add native embedder as an available embedder selection

* wrap model loader in try/catch

* print progress on download

* add built-in LLM support (expiermental)

* Update to progress output for embedder

* move embedder selection options to component

* saftey checks for modelfile

* update ref

* Hide selection when on hosted subdomain

* update documentation
hide localLlama when on hosted

* saftey checks for storage of models

* update dockerfile to pre-build Llama.cpp bindings

* update lockfile

* add langchain doc comment

* remove extraneous --no-metal option

* Show data handling for private LLM

* persist model in memory for N+1 chats

* update import
update dev comment on token model size

* update primary README

* chore: more readme updates and remove screenshots - too much to maintain, just use the app!

* remove screeshot link
2023-12-07 14:48:27 -08:00

2.2 KiB

Native models used by AnythingLLM

This folder is specifically created as a local cache and storage folder that is used for native models that can run on a CPU.

Currently, AnythingLLM uses this folder for the following parts of the application.

Embedding

When your embedding engine preference is native we will use the ONNX all-MiniLM-L6-v2 model built by Xenova on HuggingFace.co. This model is a quantized and WASM version of the popular all-MiniLM-L6-v2 which produces a 384-dimension vector.

If you are using the native embedding engine your vector database should be configured to accept 384-dimension models if that parameter is directly editable (Pinecone only).

Text generation (LLM selection)

Important

Use of a locally running LLM model is experimental and may behave unexpectedly, crash, or not function at all. We suggest for production-use of a local LLM model to use a purpose-built inference server like LocalAI or LMStudio.

Tip

We recommend at least using a 4-bit or 5-bit quantized model for your LLM. Lower quantization models tend to just output unreadable garbage.

If you would like to use a local Llama compatible LLM model for chatting you can select any model from this HuggingFace search filter

Requirements

  • Model must be in the latest GGUF format
  • Model should be compatible with latest llama.cpp
  • You should have the proper RAM to run such a model. Requirement depends on model size.

Where do I put my GGUF model?

Important

If running in Docker you should be running the container to a mounted storage location on the host machine so you can update the storage files directly without having to re-download or re-build your docker container. See suggested Docker config

All local models you want to have available for LLM selection should be placed in the storage/models/downloaded folder. Only .gguf files will be allowed to be selected from the UI.