anything-llm/server/storage/models/README.md

# Native models used by AnythingLLM

This folder is specifically created as a local cache and storage folder that is used for native models that can run on a CPU.

Currently, AnythingLLM uses this folder for the following parts of the application.

## Embedding
When your embedding engine preference is `native` we will use the ONNX **all-MiniLM-L6-v2** model built by [Xenova on HuggingFace.co](https://huggingface.co/Xenova/all-MiniLM-L6-v2). This model is a quantized and WASM version of the popular [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which produces a 384-dimension vector.

If you are using the `native` embedding engine your vector database should be configured to accept 384-dimension models if that parameter is directly editable (Pinecone only).

## Text generation (LLM selection)
> [!IMPORTANT]
> Use of a locally running LLM model is **experimental** and may behave unexpectedly, crash, or not function at all.
> We suggest for production-use of a local LLM model to use a purpose-built inference server like [LocalAI](https://localai.io) or [LMStudio](https://lmstudio.ai).

> [!TIP]
> We recommend at _least_ using a 4-bit or 5-bit quantized model for your LLM. Lower quantization models tend to
> just output unreadable garbage.

If you would like to use a local Llama compatible LLM model for chatting you can select any model from this [HuggingFace search filter](https://huggingface.co/models?pipeline_tag=text-generation&library=gguf&other=text-generation-inference&sort=trending)

**Requirements**
- Model must be in the latest `GGUF` format
- Model should be compatible with latest `llama.cpp`
- You should have the proper RAM to run such a model. Requirement depends on model size.

### Where do I put my GGUF model?
> [!IMPORTANT]
> If running in Docker you should be running the container to a mounted storage location on the host machine so you
> can update the storage files directly without having to re-download or re-build your docker container. [See suggested Docker config](../../../README.md#recommended-usage-with-docker-easy)

> [!NOTE]
> `/server/storage/models/downloaded` is the default location that your model files should be at. 
> Your storage directory may differ if you changed the STORAGE_DIR environment variable.

All local models you want to have available for LLM selection should be placed in the `server/storage/models/downloaded` folder. Only `.gguf` files will be allowed to be selected from the UI.
[Feature] AnythingLLM use locally hosted Llama.cpp and GGUF files for inferencing (#413) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * add built-in LLM support (expiermental) * Update to progress output for embedder * move embedder selection options to component * saftey checks for modelfile * update ref * Hide selection when on hosted subdomain * update documentation hide localLlama when on hosted * saftey checks for storage of models * update dockerfile to pre-build Llama.cpp bindings * update lockfile * add langchain doc comment * remove extraneous --no-metal option * Show data handling for private LLM * persist model in memory for N+1 chats * update import update dev comment on token model size * update primary README * chore: more readme updates and remove screenshots - too much to maintain, just use the app! * remove screeshot link 2023-12-07 23:48:27 +01:00			`# Native models used by AnythingLLM`
Add built-in embedding engine into AnythingLLM (#411) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * Update to progress output for embedder * move embedder selection options to component * forgot import * add Data privacy alert updates for local embedder 2023-12-06 19:36:22 +01:00
			`This folder is specifically created as a local cache and storage folder that is used for native models that can run on a CPU.`

			`Currently, AnythingLLM uses this folder for the following parts of the application.`

[Feature] AnythingLLM use locally hosted Llama.cpp and GGUF files for inferencing (#413) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * add built-in LLM support (expiermental) * Update to progress output for embedder * move embedder selection options to component * saftey checks for modelfile * update ref * Hide selection when on hosted subdomain * update documentation hide localLlama when on hosted * saftey checks for storage of models * update dockerfile to pre-build Llama.cpp bindings * update lockfile * add langchain doc comment * remove extraneous --no-metal option * Show data handling for private LLM * persist model in memory for N+1 chats * update import update dev comment on token model size * update primary README * chore: more readme updates and remove screenshots - too much to maintain, just use the app! * remove screeshot link 2023-12-07 23:48:27 +01:00			`## Embedding`
Add built-in embedding engine into AnythingLLM (#411) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * Update to progress output for embedder * move embedder selection options to component * forgot import * add Data privacy alert updates for local embedder 2023-12-06 19:36:22 +01:00			When your embedding engine preference is `native` we will use the ONNX all-MiniLM-L6-v2 model built by [Xenova on HuggingFace.co](https://huggingface.co/Xenova/all-MiniLM-L6-v2). This model is a quantized and WASM version of the popular [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which produces a 384-dimension vector.

			If you are using the `native` embedding engine your vector database should be configured to accept 384-dimension models if that parameter is directly editable (Pinecone only).

[Feature] AnythingLLM use locally hosted Llama.cpp and GGUF files for inferencing (#413) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * add built-in LLM support (expiermental) * Update to progress output for embedder * move embedder selection options to component * saftey checks for modelfile * update ref * Hide selection when on hosted subdomain * update documentation hide localLlama when on hosted * saftey checks for storage of models * update dockerfile to pre-build Llama.cpp bindings * update lockfile * add langchain doc comment * remove extraneous --no-metal option * Show data handling for private LLM * persist model in memory for N+1 chats * update import update dev comment on token model size * update primary README * chore: more readme updates and remove screenshots - too much to maintain, just use the app! * remove screeshot link 2023-12-07 23:48:27 +01:00			`## Text generation (LLM selection)`
			`> [!IMPORTANT]`
			`> Use of a locally running LLM model is experimental and may behave unexpectedly, crash, or not function at all.`
			`> We suggest for production-use of a local LLM model to use a purpose-built inference server like [LocalAI](https://localai.io) or [LMStudio](https://lmstudio.ai).`

			`> [!TIP]`
			`> We recommend at _least_ using a 4-bit or 5-bit quantized model for your LLM. Lower quantization models tend to`
			`> just output unreadable garbage.`

			`If you would like to use a local Llama compatible LLM model for chatting you can select any model from this [HuggingFace search filter](https://huggingface.co/models?pipeline_tag=text-generation&library=gguf&other=text-generation-inference&sort=trending)`

			`Requirements`
			- Model must be in the latest `GGUF` format
			- Model should be compatible with latest `llama.cpp`
			`- You should have the proper RAM to run such a model. Requirement depends on model size.`

			`### Where do I put my GGUF model?`
			`> [!IMPORTANT]`
			`> If running in Docker you should be running the container to a mounted storage location on the host machine so you`
			`> can update the storage files directly without having to re-download or re-build your docker container. [See suggested Docker config](../../../README.md#recommended-usage-with-docker-easy)`

docs: placeholder for model downloads folder (#446) 2023-12-14 19:31:14 +01:00			`> [!NOTE]`
			> `/server/storage/models/downloaded` is the default location that your model files should be at.
			`> Your storage directory may differ if you changed the STORAGE_DIR environment variable.`

			All local models you want to have available for LLM selection should be placed in the `server/storage/models/downloaded` folder. Only `.gguf` files will be allowed to be selected from the UI.