[Feature] AnythingLLM use locally hosted Llama.cpp and GGUF files for inferencing (#413)

* Implement use of native embedder (all-Mini-L6-v2) stop showing prisma queries during dev * Add native embedder as an available embedder selection * wrap model loader in try/catch * print progress on download * add built-in LLM support (expiermental) * Update to progress output for embedder * move embedder selection options to component * saftey checks for modelfile * update ref * Hide selection when on hosted subdomain * update documentation hide localLlama when on hosted * saftey checks for storage of models * update dockerfile to pre-build Llama.cpp bindings * update lockfile * add langchain doc comment * remove extraneous --no-metal option * Show data handling for private LLM * persist model in memory for N+1 chats * update import update dev comment on token model size * update primary README * chore: more readme updates and remove screenshots - too much to maintain, just use the app! * remove screeshot link
2024-11-17 11:40:11 +01:00 · 2023-12-07 14:48:27 -08:00 · 2023-12-07 14:48:27 -08:00 · 655ebd9479
commit 655ebd9479
parent fecfb0fafc
22 changed files with 1304 additions and 99 deletions
--- a/README.md
+++ b/README.md
@ -3,7 +3,7 @@
 </p>

 <p align="center">
-    <b>AnythingLLM: A document chatbot to chat with <i>anything!</i></b>. <br />
+    <b>AnythingLLM: A private ChatGPT to chat with <i>anything!</i></b>. <br />
    An efficient, customizable, and open-source enterprise-ready document chatbot solution.
 </p>

@ -22,10 +22,9 @@
  </a>
 </p>

-A full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use.
+A full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.

 ![Chatting](/images/screenshots/chatting.gif)
-[view more screenshots](/images/screenshots/SCREENSHOTS.md)

 ### Watch the demo!

@ -33,17 +32,16 @@ A full-stack application that enables you to turn any document, resource, or pie


 ### Product Overview
-AnythingLLM aims to be a full-stack application where you can use commercial off-the-shelf LLMs or popular open source LLMs and vectorDB solutions.
-
-Anything LLM is a full-stack product that you can run locally as well as host remotely and be able to chat intelligently with any documents you provide it.
+AnythingLLM is a full-stack application where you can use commercial off-the-shelf LLMs or popular open source LLMs and vectorDB solutions to build a private ChatGPT with no compromises that you can run locally as well as host remotely and be able to chat intelligently with any documents you provide it.

 AnythingLLM divides your documents into objects called `workspaces`. A Workspace functions a lot like a thread, but with the addition of containerization of your documents. Workspaces can share documents, but they do not talk to each other so you can keep your context for each workspace clean.

 Some cool features of AnythingLLM
 - **Multi-user instance support and permissioning**
- Atomically manage documents in your vector database from a simple UI
+- Multiple document type support (PDF, TXT, DOCX, etc)
+- Manage documents in your vector database from a simple UI
 - Two chat modes `conversation` and `query`. Conversation retains previous questions and amendments. Query is simple QA against your documents
- Each chat response contains a citation that is linked to the original document source
+- In-chat citations linked to the original document source and text
 - Simple technology stack for fast iteration
 - 100% Cloud deployment ready.
 - "Bring your own LLM" model.
@ -52,6 +50,7 @@ Some cool features of AnythingLLM

 ### Supported LLMs, Embedders, and Vector Databases
 **Supported LLMs:**
+- [Any open-source llama.cpp compatible model](/server/storage/models/README.md#text-generation-llm-selection)
 - [OpenAI](https://openai.com)
 - [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service)
 - [Anthropic ClaudeV2](https://www.anthropic.com/)
@ -80,13 +79,18 @@ This monorepo consists of three main sections:
 - `server`: A nodeJS + express server to handle all the interactions and do all the vectorDB management and LLM interactions.
 - `docker`: Docker instructions and build process + information for building from source.

-### Requirements
+### Minimum Requirements
+> [!TIP]
+> Running AnythingLLM on AWS/GCP/Azure? 
+> You should aim for at least 2GB of RAM. Disk storage is proprotional to however much data
+> you will be storing (documents, vectors, models, etc). Minimum 10GB recommended.
+
 - `yarn` and `node` on your machine
 - `python` 3.9+ for running scripts in `collector/`.
 - access to an LLM running locally or remotely.
- (optional) a vector database like Pinecone, qDrant, Weaviate, or Chroma*.

 *AnythingLLM by default uses a built-in vector database powered by [LanceDB](https://github.com/lancedb/lancedb)
+
 *AnythingLLM by default embeds text on instance privately [Learn More](/server/storage/models/README.md)

 ## Recommended usage with Docker (easy!)
@ -107,8 +111,8 @@ docker run -d -p 3001:3001 \
 mintplexlabs/anythingllm:master
 ```

-Go to `http://localhost:3001` and you are now using AnythingLLM! All your data and progress will persist between
-container rebuilds or pulls from Docker Hub.
+Open [http://localhost:3001](http://localhost:3001) and you are now using AnythingLLM! 
+All your data and progress will now persist between container rebuilds or pulls from Docker Hub.

 [Learn more about running AnythingLLM with Docker](./docker/HOW_TO_USE_DOCKER.md)

--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@ -13,7 +13,7 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
        libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libx11-6 libx11-xcb1 libxcb1 \
        libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 \
        libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release \
-        xdg-utils && \
+        xdg-utils git build-essential && \
    mkdir -p /etc/apt/keyrings && \
    curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg && \
    echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_18.x nodistro main" | tee /etc/apt/sources.list.d/nodesource.list && \
@ -60,6 +60,11 @@ RUN cd ./server/ && yarn install --production && yarn cache clean && \
    rm /app/server/node_modules/vectordb/x86_64-apple-darwin.node && \
    rm /app/server/node_modules/vectordb/aarch64-apple-darwin.node

+# Compile Llama.cpp bindings for node-llama-cpp for this operating system.
+USER root
+RUN cd ./server && npx --no node-llama-cpp download
+USER anythingllm
+
 # Build the frontend
 FROM frontend-deps as build-stage
 COPY ./frontend/ ./frontend/
--- a/frontend/src/components/LLMSelection/NativeLLMOptions/index.jsx
+++ b/frontend/src/components/LLMSelection/NativeLLMOptions/index.jsx
@ -0,0 +1,84 @@
+import { useEffect, useState } from "react";
+import { Flask } from "@phosphor-icons/react";
+import System from "@/models/system";
+
+export default function NativeLLMOptions({ settings }) {
+  return (
+    <div className="w-full flex flex-col gap-y-4">
+      <div className="flex flex-col md:flex-row md:items-center gap-x-2 text-white mb-4 bg-orange-800/30 w-fit rounded-lg px-4 py-2">
+        <div className="gap-x-2 flex items-center">
+          <Flask size={18} />
+          <p className="text-sm md:text-base">
+            Using a locally hosted LLM is experimental. Use with caution.
+          </p>
+        </div>
+      </div>
+      <div className="w-full flex items-center gap-4">
+        <NativeModelSelection settings={settings} />
+      </div>
+    </div>
+  );
+}
+
+function NativeModelSelection({ settings }) {
+  const [customModels, setCustomModels] = useState([]);
+  const [loading, setLoading] = useState(true);
+
+  useEffect(() => {
+    async function findCustomModels() {
+      setLoading(true);
+      const { models } = await System.customModels("native-llm", null, null);
+      setCustomModels(models || []);
+      setLoading(false);
+    }
+    findCustomModels();
+  }, []);
+
+  if (loading || customModels.length == 0) {
+    return (
+      <div className="flex flex-col w-60">
+        <label className="text-white text-sm font-semibold block mb-4">
+          Model Selection
+        </label>
+        <select
+          name="NativeLLMModelPref"
+          disabled={true}
+          className="bg-zinc-900 border border-gray-500 text-white text-sm rounded-lg block w-full p-2.5"
+        >
+          <option disabled={true} selected={true}>
+            -- waiting for models --
+          </option>
+        </select>
+      </div>
+    );
+  }
+
+  return (
+    <div className="flex flex-col w-60">
+      <label className="text-white text-sm font-semibold block mb-4">
+        Model Selection
+      </label>
+      <select
+        name="NativeLLMModelPref"
+        required={true}
+        className="bg-zinc-900 border border-gray-500 text-white text-sm rounded-lg block w-full p-2.5"
+      >
+        {customModels.length > 0 && (
+          <optgroup label="Your loaded models">
+            {customModels.map((model) => {
+              return (
+                <option
+                  key={model.id}
+                  value={model.id}
+                  selected={settings.NativeLLMModelPref === model.id}
+                >
+                  {model.id}
+                </option>
+              );
+            })}
+          </optgroup>
+        )}
+      </select>
+    </div>
+  );
+}
--- a/frontend/src/components/PrivateRoute/index.jsx
+++ b/frontend/src/components/PrivateRoute/index.jsx
@ -21,9 +21,8 @@ function useIsAuthenticated() {
      const {
        MultiUserMode,
        RequiresAuth,
-        OpenAiKey = false,
-        AnthropicApiKey = false,
-        AzureOpenAiKey = false,
+        LLMProvider = null,
+        VectorDB = null,
      } = await System.keys();

      setMultiUserMode(MultiUserMode);
@ -32,9 +31,8 @@ function useIsAuthenticated() {
      if (
        !MultiUserMode &&
        !RequiresAuth && // Not in Multi-user AND no password set.
-        !OpenAiKey &&
-        !AnthropicApiKey &&
-        !AzureOpenAiKey // AND no LLM API Key set at all.
+        !LLMProvider &&
+        !VectorDB
      ) {
        setShouldRedirectToOnboarding(true);
        setIsAuthed(true);
--- a/frontend/src/pages/GeneralSettings/LLMPreference/index.jsx
+++ b/frontend/src/pages/GeneralSettings/LLMPreference/index.jsx
@ -3,6 +3,7 @@ import Sidebar, { SidebarMobileHeader } from "@/components/SettingsSidebar";
 import { isMobile } from "react-device-detect";
 import System from "@/models/system";
 import showToast from "@/utils/toast";
+import AnythingLLMIcon from "@/media/logo/anything-llm-icon.png";
 import OpenAiLogo from "@/media/llmprovider/openai.png";
 import AzureOpenAiLogo from "@/media/llmprovider/azure.png";
 import AnthropicLogo from "@/media/llmprovider/anthropic.png";
@ -15,6 +16,7 @@ import AzureAiOptions from "@/components/LLMSelection/AzureAiOptions";
 import AnthropicAiOptions from "@/components/LLMSelection/AnthropicAiOptions";
 import LMStudioOptions from "@/components/LLMSelection/LMStudioOptions";
 import LocalAiOptions from "@/components/LLMSelection/LocalAiOptions";
+import NativeLLMOptions from "@/components/LLMSelection/NativeLLMOptions";

 export default function GeneralLLMPreference() {
  const [saving, setSaving] = useState(false);
@ -150,6 +152,16 @@ export default function GeneralLLMPreference() {
                  image={LocalAiLogo}
                  onClick={updateLLMChoice}
                />
+                {!window.location.hostname.includes("useanything.com") && (
+                  <LLMProviderOption
+                    name="Custom Llama Model"
+                    value="native"
+                    description="Use a downloaded custom Llama model for chatting on this AnythingLLM instance."
+                    checked={llmChoice === "native"}
+                    image={AnythingLLMIcon}
+                    onClick={updateLLMChoice}
+                  />
+                )}
              </div>
              <div className="mt-10 flex flex-wrap gap-4 max-w-[800px]">
                {llmChoice === "openai" && (
@ -167,6 +179,9 @@ export default function GeneralLLMPreference() {
                {llmChoice === "localai" && (
                  <LocalAiOptions settings={settings} showAlert={true} />
                )}
+                {llmChoice === "native" && (
+                  <NativeLLMOptions settings={settings} />
+                )}
              </div>
            </div>
          </form>
--- a/frontend/src/pages/OnboardingFlow/OnboardingModal/Steps/DataHandling/index.jsx
+++ b/frontend/src/pages/OnboardingFlow/OnboardingModal/Steps/DataHandling/index.jsx
@ -52,6 +52,13 @@ const LLM_SELECTION_PRIVACY = {
    ],
    logo: LocalAiLogo,
  },
+  native: {
+    name: "Custom Llama Model",
+    description: [
+      "Your model and chats are only accessible on this AnythingLLM instance",
+    ],
+    logo: AnythingLLMIcon,
+  },
 };

 const VECTOR_DB_PRIVACY = {
--- a/frontend/src/pages/OnboardingFlow/OnboardingModal/Steps/LLMSelection/index.jsx
+++ b/frontend/src/pages/OnboardingFlow/OnboardingModal/Steps/LLMSelection/index.jsx
@ -1,4 +1,5 @@
 import React, { memo, useEffect, useState } from "react";
+import AnythingLLMIcon from "@/media/logo/anything-llm-icon.png";
 import OpenAiLogo from "@/media/llmprovider/openai.png";
 import AzureOpenAiLogo from "@/media/llmprovider/azure.png";
 import AnthropicLogo from "@/media/llmprovider/anthropic.png";
@ -12,6 +13,7 @@ import AzureAiOptions from "@/components/LLMSelection/AzureAiOptions";
 import AnthropicAiOptions from "@/components/LLMSelection/AnthropicAiOptions";
 import LMStudioOptions from "@/components/LLMSelection/LMStudioOptions";
 import LocalAiOptions from "@/components/LLMSelection/LocalAiOptions";
+import NativeLLMOptions from "@/components/LLMSelection/NativeLLMOptions";

 function LLMSelection({ nextStep, prevStep, currentStep }) {
  const [llmChoice, setLLMChoice] = useState("openai");
@ -110,6 +112,14 @@ function LLMSelection({ nextStep, prevStep, currentStep }) {
              image={LocalAiLogo}
              onClick={updateLLMChoice}
            />
+            <LLMProviderOption
+              name="Custom Llama Model"
+              value="native"
+              description="Use a downloaded custom Llama model for chatting on this AnythingLLM instance."
+              checked={llmChoice === "native"}
+              image={AnythingLLMIcon}
+              onClick={updateLLMChoice}
+            />
          </div>
          <div className="mt-4 flex flex-wrap gap-4 max-w-[752px]">
            {llmChoice === "openai" && <OpenAiOptions settings={settings} />}
@ -121,6 +131,7 @@ function LLMSelection({ nextStep, prevStep, currentStep }) {
              <LMStudioOptions settings={settings} />
            )}
            {llmChoice === "localai" && <LocalAiOptions settings={settings} />}
+            {llmChoice === "native" && <NativeLLMOptions settings={settings} />}
          </div>
        </div>
        <div className="flex w-full justify-between items-center px-6 py-4 space-x-2 border-t rounded-b border-gray-500/50">
--- a/images/screenshots/SCREENSHOTS.md
+++ b/images/screenshots/SCREENSHOTS.md
@ -1,18 +0,0 @@
-# AnythingLLM Screenshots
-
-### Homescreen
-![Homescreen](./home.png)
-
-### Document Manager
-`Cached` means the current version of the document has been embedded before and will not cost money to convert into a vector!
-![Document Manager](./document.png)
-
-### Document Uploading & Embedding
-![Uploading Document](./uploading_doc.gif)
-
-### Chatting
-![Chatting](./chatting.gif)
-
-### Settings & Configs
-![LLM Selection](./llm_selection.png)
-![Vector Database Selection](./vector_databases.png)
--- a/images/screenshots/document.png
+++ b/images/screenshots/document.png
--- a/images/screenshots/home.png
+++ b/images/screenshots/home.png
--- a/images/screenshots/llm_selection.png
+++ b/images/screenshots/llm_selection.png
--- a/images/screenshots/uploading_doc.gif
+++ b/images/screenshots/uploading_doc.gif
--- a/images/screenshots/vector_databases.png
+++ b/images/screenshots/vector_databases.png
--- a/server/models/systemSettings.js
+++ b/server/models/systemSettings.js
@ -14,8 +14,8 @@ const SystemSettings = {
    "telemetry_id",
  ],
  currentSettings: async function () {
-    const llmProvider = process.env.LLM_PROVIDER || "openai";
-    const vectorDB = process.env.VECTOR_DB || "lancedb";
+    const llmProvider = process.env.LLM_PROVIDER;
+    const vectorDB = process.env.VECTOR_DB;
    return {
      RequiresAuth: !!process.env.AUTH_TOKEN,
      AuthToken: !!process.env.AUTH_TOKEN,
@ -111,6 +111,11 @@ const SystemSettings = {
            AzureOpenAiEmbeddingModelPref: process.env.EMBEDDING_MODEL_PREF,
          }
        : {}),
+      ...(llmProvider === "native"
+        ? {
+            NativeLLMModelPref: process.env.NATIVE_LLM_MODEL_PREF,
+          }
+        : {}),
    };
  },

--- a/server/package.json
+++ b/server/package.json
@ -41,10 +41,11 @@
    "joi-password-complexity": "^5.2.0",
    "js-tiktoken": "^1.0.7",
    "jsonwebtoken": "^8.5.1",
-    "langchain": "^0.0.90",
+    "langchain": "0.0.201",
    "mime": "^3.0.0",
    "moment": "^2.29.4",
    "multer": "^1.4.5-lts.1",
+    "node-llama-cpp": "^2.8.0",
    "openai": "^3.2.1",
    "pinecone-client": "^1.1.0",
    "posthog-node": "^3.1.1",
@ -64,4 +65,4 @@
    "nodemon": "^2.0.22",
    "prettier": "^2.4.1"
  }
-}
+}
--- a/server/storage/models/README.md
+++ b/server/storage/models/README.md
@ -1,13 +1,33 @@
-## Native models used by AnythingLLM
+# Native models used by AnythingLLM

 This folder is specifically created as a local cache and storage folder that is used for native models that can run on a CPU.

 Currently, AnythingLLM uses this folder for the following parts of the application.

-### Embedding
+## Embedding
 When your embedding engine preference is `native` we will use the ONNX **all-MiniLM-L6-v2** model built by [Xenova on HuggingFace.co](https://huggingface.co/Xenova/all-MiniLM-L6-v2). This model is a quantized and WASM version of the popular [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which produces a 384-dimension vector.

 If you are using the `native` embedding engine your vector database should be configured to accept 384-dimension models if that parameter is directly editable (Pinecone only).

-### Text generation (LLM selection)
-_in progress_
+## Text generation (LLM selection)
+> [!IMPORTANT]
+> Use of a locally running LLM model is **experimental** and may behave unexpectedly, crash, or not function at all.
+> We suggest for production-use of a local LLM model to use a purpose-built inference server like [LocalAI](https://localai.io) or [LMStudio](https://lmstudio.ai).
+
+> [!TIP]
+> We recommend at _least_ using a 4-bit or 5-bit quantized model for your LLM. Lower quantization models tend to
+> just output unreadable garbage.
+
+If you would like to use a local Llama compatible LLM model for chatting you can select any model from this [HuggingFace search filter](https://huggingface.co/models?pipeline_tag=text-generation&library=gguf&other=text-generation-inference&sort=trending)
+
+**Requirements**
+- Model must be in the latest `GGUF` format
+- Model should be compatible with latest `llama.cpp`
+- You should have the proper RAM to run such a model. Requirement depends on model size.
+
+### Where do I put my GGUF model?
+> [!IMPORTANT]
+> If running in Docker you should be running the container to a mounted storage location on the host machine so you
+> can update the storage files directly without having to re-download or re-build your docker container. [See suggested Docker config](../../../README.md#recommended-usage-with-docker-easy)
+
+All local models you want to have available for LLM selection should be placed in the `storage/models/downloaded` folder. Only `.gguf` files will be allowed to be selected from the UI.
--- a/server/utils/AiProviders/native/index.js
+++ b/server/utils/AiProviders/native/index.js
@ -0,0 +1,196 @@
+const os = require("os");
+const fs = require("fs");
+const path = require("path");
+const { NativeEmbedder } = require("../../EmbeddingEngines/native");
+const { HumanMessage, SystemMessage, AIMessage } = require("langchain/schema");
+const { chatPrompt } = require("../../chats");
+
+// Docs: https://api.js.langchain.com/classes/chat_models_llama_cpp.ChatLlamaCpp.html
+const ChatLlamaCpp = (...args) =>
+  import("langchain/chat_models/llama_cpp").then(
+    ({ ChatLlamaCpp }) => new ChatLlamaCpp(...args)
+  );
+
+class NativeLLM {
+  constructor(embedder = null) {
+    if (!process.env.NATIVE_LLM_MODEL_PREF)
+      throw new Error("No local Llama model was set.");
+
+    this.model = process.env.NATIVE_LLM_MODEL_PREF || null;
+    this.limits = {
+      history: this.promptWindowLimit() * 0.15,
+      system: this.promptWindowLimit() * 0.15,
+      user: this.promptWindowLimit() * 0.7,
+    };
+    this.embedder = embedder || new NativeEmbedder();
+    this.cacheDir = path.resolve(
+      process.env.STORAGE_DIR
+        ? path.resolve(process.env.STORAGE_DIR, "models", "downloaded")
+        : path.resolve(__dirname, `../../../storage/models/downloaded`)
+    );
+
+    // Set ENV for if llama.cpp needs to rebuild at runtime and machine is not
+    // running Apple Silicon.
+    process.env.NODE_LLAMA_CPP_METAL = os
+      .cpus()
+      .some((cpu) => cpu.model.includes("Apple"));
+
+    // Make directory when it does not exist in existing installations
+    if (!fs.existsSync(this.cacheDir)) fs.mkdirSync(this.cacheDir);
+  }
+
+  async #initializeLlamaModel(temperature = 0.7) {
+    const modelPath = path.join(this.cacheDir, this.model);
+    if (!fs.existsSync(modelPath))
+      throw new Error(
+        `Local Llama model ${this.model} was not found in storage!`
+      );
+
+    global.llamaModelInstance = await ChatLlamaCpp({
+      modelPath,
+      temperature,
+      useMlock: true,
+    });
+  }
+
+  // If the model has been loaded once, it is in the memory now
+  // so we can skip  re-loading it and instead go straight to inference.
+  // Note: this will break temperature setting hopping between workspaces with different temps.
+  async llamaClient({ temperature = 0.7 }) {
+    if (global.llamaModelInstance) return global.llamaModelInstance;
+    await this.#initializeLlamaModel(temperature);
+    return global.llamaModelInstance;
+  }
+
+  streamingEnabled() {
+    return "streamChat" in this && "streamGetChatCompletion" in this;
+  }
+
+  // Ensure the user set a value for the token limit
+  // and if undefined - assume 4096 window.
+  // DEV: Currently this ENV is not configurable.
+  promptWindowLimit() {
+    const limit = process.env.NATIVE_LLM_MODEL_TOKEN_LIMIT || 4096;
+    if (!limit || isNaN(Number(limit)))
+      throw new Error("No NativeAI token context limit was set.");
+    return Number(limit);
+  }
+
+  constructPrompt({
+    systemPrompt = "",
+    contextTexts = [],
+    chatHistory = [],
+    userPrompt = "",
+  }) {
+    const prompt = {
+      role: "system",
+      content: `${systemPrompt}
+Context:
+    ${contextTexts
+      .map((text, i) => {
+        return `[CONTEXT ${i}]:\n${text}\n[END CONTEXT ${i}]\n\n`;
+      })
+      .join("")}`,
+    };
+    return [prompt, ...chatHistory, { role: "user", content: userPrompt }];
+  }
+
+  async isSafe(_input = "") {
+    // Not implemented so must be stubbed
+    return { safe: true, reasons: [] };
+  }
+
+  async sendChat(chatHistory = [], prompt, workspace = {}, rawHistory = []) {
+    try {
+      const messages = await this.compressMessages(
+        {
+          systemPrompt: chatPrompt(workspace),
+          userPrompt: prompt,
+          chatHistory,
+        },
+        rawHistory
+      );
+
+      const model = await this.llamaClient({
+        temperature: Number(workspace?.openAiTemp ?? 0.7),
+      });
+      const response = await model.call(messages);
+      return response.content;
+    } catch (error) {
+      throw new Error(
+        `NativeLLM::createChatCompletion failed with: ${error.message}`
+      );
+    }
+  }
+
+  async streamChat(chatHistory = [], prompt, workspace = {}, rawHistory = []) {
+    const model = await this.llamaClient({
+      temperature: Number(workspace?.openAiTemp ?? 0.7),
+    });
+    const messages = await this.compressMessages(
+      {
+        systemPrompt: chatPrompt(workspace),
+        userPrompt: prompt,
+        chatHistory,
+      },
+      rawHistory
+    );
+    const responseStream = await model.stream(messages);
+    return responseStream;
+  }
+
+  async getChatCompletion(messages = null, { temperature = 0.7 }) {
+    const model = await this.llamaClient({ temperature });
+    const response = await model.call(messages);
+    return response.content;
+  }
+
+  async streamGetChatCompletion(messages = null, { temperature = 0.7 }) {
+    const model = await this.llamaClient({ temperature });
+    const responseStream = await model.stream(messages);
+    return responseStream;
+  }
+
+  // Simple wrapper for dynamic embedder & normalize interface for all LLM implementations
+  async embedTextInput(textInput) {
+    return await this.embedder.embedTextInput(textInput);
+  }
+  async embedChunks(textChunks = []) {
+    return await this.embedder.embedChunks(textChunks);
+  }
+
+  async compressMessages(promptArgs = {}, rawHistory = []) {
+    const { messageArrayCompressor } = require("../../helpers/chat");
+    const messageArray = this.constructPrompt(promptArgs);
+    const compressedMessages = await messageArrayCompressor(
+      this,
+      messageArray,
+      rawHistory
+    );
+    return this.convertToLangchainPrototypes(compressedMessages);
+  }
+
+  convertToLangchainPrototypes(chats = []) {
+    const langchainChats = [];
+    for (const chat of chats) {
+      switch (chat.role) {
+        case "system":
+          langchainChats.push(new SystemMessage({ content: chat.content }));
+          break;
+        case "user":
+          langchainChats.push(new HumanMessage({ content: chat.content }));
+          break;
+        case "assistant":
+          langchainChats.push(new AIMessage({ content: chat.content }));
+          break;
+        default:
+          break;
+      }
+    }
+    return langchainChats;
+  }
+}
+
+module.exports = {
+  NativeLLM,
+};
--- a/server/utils/chats/stream.js
+++ b/server/utils/chats/stream.js
@ -201,6 +201,36 @@ async function streamEmptyEmbeddingChat({

 function handleStreamResponses(response, stream, responseProps) {
  const { uuid = uuidv4(), sources = [] } = responseProps;
+
+  // If stream is not a regular OpenAI Stream (like if using native model)
+  // we can just iterate the stream content instead.
+  if (!stream.hasOwnProperty("data")) {
+    return new Promise(async (resolve) => {
+      let fullText = "";
+      for await (const chunk of stream) {
+        fullText += chunk.content;
+        writeResponseChunk(response, {
+          uuid,
+          sources: [],
+          type: "textResponseChunk",
+          textResponse: chunk.content,
+          close: false,
+          error: false,
+        });
+      }
+
+      writeResponseChunk(response, {
+        uuid,
+        sources,
+        type: "textResponseChunk",
+        textResponse: "",
+        close: true,
+        error: false,
+      });
+      resolve(fullText);
+    });
+  }
+
  return new Promise((resolve) => {
    let fullText = "";
    let chunk = "";
--- a/server/utils/helpers/customModels.js
+++ b/server/utils/helpers/customModels.js
@ -1,4 +1,4 @@
-const SUPPORT_CUSTOM_MODELS = ["openai", "localai"];
+const SUPPORT_CUSTOM_MODELS = ["openai", "localai", "native-llm"];

 async function getCustomModels(provider = "", apiKey = null, basePath = null) {
  if (!SUPPORT_CUSTOM_MODELS.includes(provider))
@ -9,6 +9,8 @@ async function getCustomModels(provider = "", apiKey = null, basePath = null) {
      return await openAiModels(apiKey);
    case "localai":
      return await localAIModels(basePath);
+    case "native-llm":
+      return nativeLLMModels();
    default:
      return { models: [], error: "Invalid provider for custom models" };
  }
@ -53,6 +55,26 @@ async function localAIModels(basePath = null, apiKey = null) {
  return { models, error: null };
 }

+function nativeLLMModels() {
+  const fs = require("fs");
+  const path = require("path");
+  const storageDir = path.resolve(
+    process.env.STORAGE_DIR
+      ? path.resolve(process.env.STORAGE_DIR, "models", "downloaded")
+      : path.resolve(__dirname, `../../storage/models/downloaded`)
+  );
+  if (!fs.existsSync(storageDir))
+    return { models: [], error: "No model/downloaded storage folder found." };
+
+  const files = fs
+    .readdirSync(storageDir)
+    .filter((file) => file.toLowerCase().includes(".gguf"))
+    .map((file) => {
+      return { id: file, name: file };
+    });
+  return { models: files, error: null };
+}
+
 module.exports = {
  getCustomModels,
 };
--- a/server/utils/helpers/index.js
+++ b/server/utils/helpers/index.js
@ -40,6 +40,9 @@ function getLLMProvider() {
    case "localai":
      const { LocalAiLLM } = require("../AiProviders/localAi");
      return new LocalAiLLM(embedder);
+    case "native":
+      const { NativeLLM } = require("../AiProviders/native");
+      return new NativeLLM(embedder);
    default:
      throw new Error("ENV: No LLM_PROVIDER value found in environment!");
  }
--- a/server/utils/helpers/updateENV.js
+++ b/server/utils/helpers/updateENV.js
@ -72,6 +72,12 @@ const KEY_MAPPING = {
    checks: [],
  },

+  // Native LLM Settings
+  NativeLLMModelPref: {
+    envKey: "NATIVE_LLM_MODEL_PREF",
+    checks: [isDownloadedModel],
+  },
+
  EmbeddingEngine: {
    envKey: "EMBEDDING_ENGINE",
    checks: [supportedEmbeddingModel],
@ -190,9 +196,14 @@ function validLLMExternalBasePath(input = "") {
 }

 function supportedLLM(input = "") {
-  return ["openai", "azure", "anthropic", "lmstudio", "localai"].includes(
-    input
-  );
+  return [
+    "openai",
+    "azure",
+    "anthropic",
+    "lmstudio",
+    "localai",
+    "native",
+  ].includes(input);
 }

 function validAnthropicModel(input = "") {
@ -245,6 +256,22 @@ function requiresForceMode(_, forceModeEnabled = false) {
  return forceModeEnabled === true ? null : "Cannot set this setting.";
 }

+function isDownloadedModel(input = "") {
+  const fs = require("fs");
+  const path = require("path");
+  const storageDir = path.resolve(
+    process.env.STORAGE_DIR
+      ? path.resolve(process.env.STORAGE_DIR, "models", "downloaded")
+      : path.resolve(__dirname, `../../storage/models/downloaded`)
+  );
+  if (!fs.existsSync(storageDir)) return false;
+
+  const files = fs
+    .readdirSync(storageDir)
+    .filter((file) => file.includes(".gguf"));
+  return files.includes(input);
+}
+
 // This will force update .env variables which for any which reason were not able to be parsed or
 // read from an ENV file as this seems to be a complicating step for many so allowing people to write
 // to the process will at least alleviate that issue. It does not perform comprehensive validity checks or sanity checks
--- a/server/yarn.lock
+++ b/server/yarn.lock