anything-llm/server/storage/models/README.md

# Native models used by AnythingLLM

This folder is specifically created as a local cache and storage folder that is used for native models that can run on a CPU.

Currently, AnythingLLM uses this folder for the following parts of the application.

## Embedding
When your embedding engine preference is `native` we will use the ONNX **all-MiniLM-L6-v2** model built by [Xenova on HuggingFace.co](https://huggingface.co/Xenova/all-MiniLM-L6-v2). This model is a quantized and WASM version of the popular [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which produces a 384-dimension vector.

If you are using the `native` embedding engine your vector database should be configured to accept 384-dimension models if that parameter is directly editable (Pinecone only).

## Audio/Video transcription
AnythingLLM allows you to upload various audio and video formats as source documents. In all cases the audio tracks will be transcribed by a locally running ONNX model **whisper-small** built by [Xenova on HuggingFace.co](https://huggingface.co/Xenova/whisper-small). The model is a smaller version of the OpenAI Whisper model. Given the model runs locally on CPU, larger files will result in longer transcription times.

Once transcribed you can embed these transcriptions into your workspace like you would any other file! 

**Other external model/transcription providers are also live.**
- [OpenAI Whisper via API key.](https://openai.com/research/whisper)

## Text generation (LLM selection)
> [!IMPORTANT]
> Use of a locally running LLM model is **experimental** and may behave unexpectedly, crash, or not function at all.
> We suggest for production-use of a local LLM model to use a purpose-built inference server like [LocalAI](https://localai.io) or [LMStudio](https://lmstudio.ai).

> [!TIP]
> We recommend at _least_ using a 4-bit or 5-bit quantized model for your LLM. Lower quantization models tend to
> just output unreadable garbage.

If you would like to use a local Llama compatible LLM model for chatting you can select any model from this [HuggingFace search filter](https://huggingface.co/models?pipeline_tag=text-generation&library=gguf&other=text-generation-inference&sort=trending)

**Requirements**
- Model must be in the latest `GGUF` format
- Model should be compatible with latest `llama.cpp`
- You should have the proper RAM to run such a model. Requirement depends on model size.

### Where do I put my GGUF model?
> [!IMPORTANT]
> If running in Docker you should be running the container to a mounted storage location on the host machine so you
> can update the storage files directly without having to re-download or re-build your docker container. [See suggested Docker config](../../../README.md#recommended-usage-with-docker-easy)

> [!NOTE]
> `/server/storage/models/downloaded` is the default location that your model files should be at. 
> Your storage directory may differ if you changed the STORAGE_DIR environment variable.

All local models you want to have available for LLM selection should be placed in the `server/storage/models/downloaded` folder. Only `.gguf` files will be allowed to be selected from the UI.
[Feature] AnythingLLM use locally hosted Llama.cpp and GGUF files for inferencing (#413) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * add built-in LLM support (expiermental) * Update to progress output for embedder * move embedder selection options to component * saftey checks for modelfile * update ref * Hide selection when on hosted subdomain * update documentation hide localLlama when on hosted * saftey checks for storage of models * update dockerfile to pre-build Llama.cpp bindings * update lockfile * add langchain doc comment * remove extraneous --no-metal option * Show data handling for private LLM * persist model in memory for N+1 chats * update import update dev comment on token model size * update primary README * chore: more readme updates and remove screenshots - too much to maintain, just use the app! * remove screeshot link 2023-12-07 23:48:27 +01:00			`# Native models used by AnythingLLM`
Add built-in embedding engine into AnythingLLM (#411) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * Update to progress output for embedder * move embedder selection options to component * forgot import * add Data privacy alert updates for local embedder 2023-12-06 19:36:22 +01:00
			`This folder is specifically created as a local cache and storage folder that is used for native models that can run on a CPU.`

			`Currently, AnythingLLM uses this folder for the following parts of the application.`

[Feature] AnythingLLM use locally hosted Llama.cpp and GGUF files for inferencing (#413) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * add built-in LLM support (expiermental) * Update to progress output for embedder * move embedder selection options to component * saftey checks for modelfile * update ref * Hide selection when on hosted subdomain * update documentation hide localLlama when on hosted * saftey checks for storage of models * update dockerfile to pre-build Llama.cpp bindings * update lockfile * add langchain doc comment * remove extraneous --no-metal option * Show data handling for private LLM * persist model in memory for N+1 chats * update import update dev comment on token model size * update primary README * chore: more readme updates and remove screenshots - too much to maintain, just use the app! * remove screeshot link 2023-12-07 23:48:27 +01:00			`## Embedding`
Add built-in embedding engine into AnythingLLM (#411) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * Update to progress output for embedder * move embedder selection options to component * forgot import * add Data privacy alert updates for local embedder 2023-12-06 19:36:22 +01:00			When your embedding engine preference is `native` we will use the ONNX all-MiniLM-L6-v2 model built by [Xenova on HuggingFace.co](https://huggingface.co/Xenova/all-MiniLM-L6-v2). This model is a quantized and WASM version of the popular [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which produces a 384-dimension vector.

			If you are using the `native` embedding engine your vector database should be configured to accept 384-dimension models if that parameter is directly editable (Pinecone only).

feat: Embed on-instance Whisper model for audio/mp4 transcribing (#449) * feat: Embed on-instance Whisper model for audio/mp4 transcribing resolves #329 * additional logging * add placeholder for tmp folder in collector storage Add cleanup of hotdir and tmp on collector boot to prevent hanging files split loading of model and file conversion into concurrency * update README * update model size * update supported filetypes 2023-12-15 20:20:13 +01:00			`## Audio/Video transcription`
			`AnythingLLM allows you to upload various audio and video formats as source documents. In all cases the audio tracks will be transcribed by a locally running ONNX model whisper-small built by [Xenova on HuggingFace.co](https://huggingface.co/Xenova/whisper-small). The model is a smaller version of the OpenAI Whisper model. Given the model runs locally on CPU, larger files will result in longer transcription times.`

			`Once transcribed you can embed these transcriptions into your workspace like you would any other file!`

Support external transcription providers (#909) * Support External Transcription providers * patch files * update docs * fix return data 2024-03-14 23:43:26 +01:00			`Other external model/transcription providers are also live.`
			`- [OpenAI Whisper via API key.](https://openai.com/research/whisper)`

[Feature] AnythingLLM use locally hosted Llama.cpp and GGUF files for inferencing (#413) * Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * add built-in LLM support (expiermental) * Update to progress output for embedder * move embedder selection options to component * saftey checks for modelfile * update ref * Hide selection when on hosted subdomain * update documentation hide localLlama when on hosted * saftey checks for storage of models * update dockerfile to pre-build Llama.cpp bindings * update lockfile * add langchain doc comment * remove extraneous --no-metal option * Show data handling for private LLM * persist model in memory for N+1 chats * update import update dev comment on token model size * update primary README * chore: more readme updates and remove screenshots - too much to maintain, just use the app! * remove screeshot link 2023-12-07 23:48:27 +01:00			`## Text generation (LLM selection)`
			`> [!IMPORTANT]`
			`> Use of a locally running LLM model is experimental and may behave unexpectedly, crash, or not function at all.`
			`> We suggest for production-use of a local LLM model to use a purpose-built inference server like [LocalAI](https://localai.io) or [LMStudio](https://lmstudio.ai).`

			`> [!TIP]`
			`> We recommend at _least_ using a 4-bit or 5-bit quantized model for your LLM. Lower quantization models tend to`
			`> just output unreadable garbage.`

			`If you would like to use a local Llama compatible LLM model for chatting you can select any model from this [HuggingFace search filter](https://huggingface.co/models?pipeline_tag=text-generation&library=gguf&other=text-generation-inference&sort=trending)`

			`Requirements`
			- Model must be in the latest `GGUF` format
			- Model should be compatible with latest `llama.cpp`
			`- You should have the proper RAM to run such a model. Requirement depends on model size.`

			`### Where do I put my GGUF model?`
			`> [!IMPORTANT]`
			`> If running in Docker you should be running the container to a mounted storage location on the host machine so you`
			`> can update the storage files directly without having to re-download or re-build your docker container. [See suggested Docker config](../../../README.md#recommended-usage-with-docker-easy)`

docs: placeholder for model downloads folder (#446) 2023-12-14 19:31:14 +01:00			`> [!NOTE]`
			> `/server/storage/models/downloaded` is the default location that your model files should be at.
			`> Your storage directory may differ if you changed the STORAGE_DIR environment variable.`

			All local models you want to have available for LLM selection should be placed in the `server/storage/models/downloaded` folder. Only `.gguf` files will be allowed to be selected from the UI.