Document Processor v2 (#442)

* wip: init refactor of document processor to JS

* add NodeJs PDF support

* wip: partity with python processor
feat: add pptx support

* fix: forgot files

* Remove python scripts totally

* wip:update docker to boot new collector

* add package.json support

* update dockerfile for new build

* update gitignore and linting

* add more protections on file lookup

* update package.json

* test build

* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
This commit is contained in:
Timothy Carambat 2023-12-14 15:14:56 -08:00 committed by GitHub
parent 5f6a013139
commit 719521c307
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
69 changed files with 3682 additions and 1925 deletions

View File

@ -74,10 +74,10 @@ Some cool features of AnythingLLM
### Technical Overview
This monorepo consists of three main sections:
- `collector`: Python tools that enable you to quickly convert online resources or local documents into LLM useable format.
- `frontend`: A viteJS + React frontend that you can run to easily create and manage all your content the LLM can use.
- `server`: A nodeJS + express server to handle all the interactions and do all the vectorDB management and LLM interactions.
- `server`: A NodeJS express server to handle all the interactions and do all the vectorDB management and LLM interactions.
- `docker`: Docker instructions and build process + information for building from source.
- `collector`: NodeJS express server that process and parses documents from the UI.
### Minimum Requirements
> [!TIP]
@ -86,7 +86,6 @@ This monorepo consists of three main sections:
> you will be storing (documents, vectors, models, etc). Minimum 10GB recommended.
- `yarn` and `node` on your machine
- `python` 3.9+ for running scripts in `collector/`.
- access to an LLM running locally or remotely.
*AnythingLLM by default uses a built-in vector database powered by [LanceDB](https://github.com/lancedb/lancedb)
@ -112,6 +111,7 @@ export STORAGE_LOCATION="/var/lib/anythingllm" && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run -d -p 3001:3001 \
--cap-add SYS_ADMIN \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
@ -141,12 +141,6 @@ To boot the frontend locally (run commands from root of repo):
[Learn about vector caching](./server/storage/vector-cache/VECTOR_CACHE.md)
## Standalone scripts
This repo contains standalone scripts you can run to collect data from a Youtube Channel, Medium articles, local text files, word documents, and the list goes on. This is where you will use the `collector/` part of the repo.
[Go set up and run collector scripts](./collector/README.md)
## Contributing
- create issue
- create PR with branch name format of `<issue number>-<short name>`

View File

@ -1,6 +1,6 @@
# How to deploy a private AnythingLLM instance on AWS
With an AWS account you can easily deploy a private AnythingLLM instance on AWS. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
With an AWS account you can easily deploy a private AnythingLLM instance on AWS. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.
**Quick Launch (EASY)**
1. Log in to your AWS account
@ -30,12 +30,11 @@ The output of this cloudformation stack will be:
**Requirements**
- An AWS account with billing information.
- AnythingLLM (GUI + document processor) must use a t2.small minimum and 10Gib SSD hard disk volume
## Please read this notice before submitting issues about your deployment
**Note:**
Your instance will not be available instantly. Depending on the instance size you launched with it can take varying amounts of time to fully boot up.
Your instance will not be available instantly. Depending on the instance size you launched with it can take 5-10 minutes to fully boot up.
If you want to check the instance's progress, navigate to [your deployed EC2 instances](https://us-west-1.console.aws.amazon.com/ec2/home) and connect to your instance via SSH in browser.

View File

@ -89,7 +89,7 @@
"touch /home/ec2-user/anythingllm/.env\n",
"sudo chown ec2-user:ec2-user -R /home/ec2-user/anythingllm\n",
"docker pull mintplexlabs/anythingllm:master\n",
"docker run -d -p 3001:3001 -v /home/ec2-user/anythingllm:/app/server/storage -v /home/ec2-user/anythingllm/.env:/app/server/.env -e STORAGE_DIR=\"/app/server/storage\" mintplexlabs/anythingllm:master\n",
"docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/ec2-user/anythingllm:/app/server/storage -v /home/ec2-user/anythingllm/.env:/app/server/.env -e STORAGE_DIR=\"/app/server/storage\" mintplexlabs/anythingllm:master\n",
"echo \"Container ID: $(sudo docker ps --latest --quiet)\"\n",
"export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)\n",
"echo \"Health check: $ONLINE\"\n",

View File

@ -1,8 +1,6 @@
# How to deploy a private AnythingLLM instance on DigitalOcean using Terraform
With a DigitalOcean account, you can easily deploy a private AnythingLLM instance using Terraform. This will create a URL that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys, and they will not be exposed. However, if you want your instance to be protected, it is highly recommended that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
[Refer to .env.example](../../../docker/HOW_TO_USE_DOCKER.md) for data format.
With a DigitalOcean account, you can easily deploy a private AnythingLLM instance using Terraform. This will create a URL that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys, and they will not be exposed. However, if you want your instance to be protected, it is highly recommended that you set a password one setup is complete.
The output of this Terraform configuration will be:
- 1 DigitalOcean Droplet
@ -12,8 +10,6 @@ The output of this Terraform configuration will be:
- An DigitalOcean account with billing information
- Terraform installed on your local machine
- Follow the instructions in the [official Terraform documentation](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) for your operating system.
- `.env` file that is filled out with your settings and set up in the `docker/` folder
## How to deploy on DigitalOcean
Open your terminal and navigate to the `digitalocean/terraform` folder
@ -36,7 +32,7 @@ terraform destroy
## Please read this notice before submitting issues about your deployment
**Note:**
Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 10-20 minutes to fully boot up.
Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 5-10 minutes to fully boot up.
If you want to check the instances progress, navigate to [your deployed instances](https://cloud.digitalocean.com/droplets) and connect to your instance via SSH in browser.

View File

@ -12,7 +12,7 @@ mkdir -p /home/anythingllm
touch /home/anythingllm/.env
sudo docker pull mintplexlabs/anythingllm:master
sudo docker run -d -p 3001:3001 -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
echo "Container ID: $(sudo docker ps --latest --quiet)"
export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)

View File

@ -1,8 +1,6 @@
# How to deploy a private AnythingLLM instance on GCP
With a GCP account you can easily deploy a private AnythingLLM instance on GCP. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
[Refer to .env.example](../../../docker/HOW_TO_USE_DOCKER.md) for data format.
With a GCP account you can easily deploy a private AnythingLLM instance on GCP. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.
The output of this cloudformation stack will be:
- 1 GCP VM
@ -11,19 +9,15 @@ The output of this cloudformation stack will be:
**Requirements**
- An GCP account with billing information.
- AnythingLLM (GUI + document processor) must use a n1-standard-1 minimum and 10Gib SSD hard disk volume
- `.env` file that is filled out with your settings and set up in the `docker/` folder
## How to deploy on GCP
Open your terminal
1. Generate your specific cloudformation document by running `yarn generate:gcp_deployment` from the project root directory.
2. This will create a new file (`gcp_deploy_anything_llm_with_env.yaml`) in the `gcp/deployment` folder.
3. Log in to your GCP account using the following command:
1. Log in to your GCP account using the following command:
```
gcloud auth login
```
4. After successful login, Run the following command to create a deployment using the Deployment Manager CLI:
2. After successful login, Run the following command to create a deployment using the Deployment Manager CLI:
```
@ -57,5 +51,4 @@ If you want to check the instances progress, navigate to [your deployed instance
Once connected run `sudo tail -f /var/log/cloud-init-output.log` and wait for the file to conclude deployment of the docker image.
Additionally, your use of this deployment process means you are responsible for any costs of these GCP resources fully.
Additionally, your use of this deployment process means you are responsible for any costs of these GCP resources fully.

View File

@ -34,7 +34,7 @@ resources:
touch /home/anythingllm/.env
sudo docker pull mintplexlabs/anythingllm:master
sudo docker run -d -p 3001:3001 -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
echo "Container ID: $(sudo docker ps --latest --quiet)"
export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)

View File

@ -1,61 +0,0 @@
import fs from 'fs';
import { fileURLToPath } from 'url';
import path, { dirname } from 'path';
import { exit } from 'process';
const __dirname = dirname(fileURLToPath(import.meta.url));
const REPLACEMENT_KEY = '!SUB::USER::CONTENT!'
const envPath = path.resolve(__dirname, `../../../docker/.env`)
const envFileExists = fs.existsSync(envPath);
const chalk = {
redBright: function (text) {
return `\x1b[31m${text}\x1b[0m`
},
cyan: function (text) {
return `\x1b[36m${text}\x1b[0m`
},
greenBright: function (text) {
return `\x1b[32m${text}\x1b[0m`
},
blueBright: function (text) {
return `\x1b[34m${text}\x1b[0m`
}
}
if (!envFileExists) {
console.log(chalk.redBright('[ABORT]'), 'You do not have an .env file in your ./docker/ folder. You need to create it first.');
console.log('You can start by running', chalk.cyan('cp -n ./docker/.env.example ./docker/.env'))
exit(1);
}
// Remove comments
// Remove UID,GID,etc
// Remove empty strings
// Split into array
const settings = fs.readFileSync(envPath, "utf8")
.replace(/^#.*\n?/gm, '')
.replace(/^UID.*\n?/gm, '')
.replace(/^GID.*\n?/gm, '')
.replace(/^CLOUD_BUILD.*\n?/gm, '')
.replace(/^\s*\n/gm, "")
.split('\n')
.filter((i) => !!i);
const formattedSettings = settings.map((i, index) => index === 0 ? i + '\n' : ' ' + i).join('\n');
// Read the existing GCP Deployment Manager template
const templatePath = path.resolve(__dirname, `gcp_deploy_anything_llm.yaml`);
const templateString = fs.readFileSync(templatePath, "utf8");
// Update the metadata section with the UserData content
const updatedTemplateString = templateString.replace(REPLACEMENT_KEY, formattedSettings);
// Save the updated GCP Deployment Manager template
const output = path.resolve(__dirname, `gcp_deploy_anything_llm_with_env.yaml`);
fs.writeFileSync(output, updatedTemplateString, "utf8");
console.log(chalk.greenBright('[SUCCESS]'), 'Deploy AnythingLLM on GCP Deployment Manager using your template document.');
console.log(chalk.greenBright('File Created:'), 'gcp_deploy_anything_llm_with_env.yaml in the output directory.');
console.log(chalk.blueBright('[INFO]'), 'Refer to the GCP Deployment Manager documentation for how to use this file.');
exit();

View File

@ -1 +0,0 @@
GOOGLE_APIS_KEY=

10
collector/.gitignore vendored
View File

@ -1,8 +1,6 @@
outputs/*/*.json
hotdir/*
hotdir/processed/*
hotdir/failed/*
!hotdir/__HOTDIR__.md
!hotdir/processed
!hotdir/failed
yarn-error.log
!yarn.lock
outputs
scripts

1
collector/.nvmrc Normal file
View File

@ -0,0 +1 @@
v18.13.0

View File

@ -1,62 +0,0 @@
# How to collect data for vectorizing
This process should be run first. This will enable you to collect a ton of data across various sources. Currently the following services are supported:
- [x] YouTube Channels
- [x] Medium
- [x] Substack
- [x] Arbitrary Link
- [x] Gitbook
- [x] Local Files (.txt, .pdf, etc) [See full list](./hotdir/__HOTDIR__.md)
_these resources are under development or require PR_
- Twitter
![Choices](../images/choices.png)
### Requirements
- [ ] Python 3.8+
- [ ] Google Cloud Account (for YouTube channels)
- [ ] `brew install pandoc` [pandoc](https://pandoc.org/installing.html) (for .ODT document processing)
### Setup
This example will be using python3.9, but will work with 3.8+. Tested on MacOs. Untested on Windows
- install virtualenv for python3.8+ first before any other steps. `python3.9 -m pip install virtualenv`
- `cd collector` from root directory
- `python3.9 -m virtualenv v-env`
- `source v-env/bin/activate`
- `pip install -r requirements.txt`
- `cp .env.example .env`
- `python main.py` for interactive collection or `python watch.py` to process local documents.
- Select the option you want and follow follow the prompts - Done!
- run `deactivate` to get back to regular shell
### Outputs
All JSON file data is cached in the `output/` folder. This is to prevent redundant API calls to services which may have rate limits to quota caps. Clearing out the `output/` folder will execute the script as if there was no cache.
As files are processed you will see data being written to both the `collector/outputs` folder as well as the `server/documents` folder. Later in this process, once you boot up the server you will then bulk vectorize this content from a simple UI!
If collection fails at any point in the process it will pick up where it last bailed out so you are not reusing credits.
### Running the document processing API locally
From the `collector` directory with the `v-env` active run `flask run --host '0.0.0.0' --port 8888`.
Now uploads from the frontend will be processed as if you ran the `watch.py` script manually.
**Docker**: If you run this application via docker the API is already started for you and no additional action is needed.
### How to get a Google Cloud API Key (YouTube data collection only)
**required to fetch YouTube transcripts and data**
- Have a google account
- [Visit the GCP Cloud Console](https://console.cloud.google.com/welcome)
- Click on dropdown in top right > Create new project. Name it whatever you like
- ![GCP Project Bar](../images/gcp-project-bar.png)
- [Enable YouTube Data APIV3](https://console.cloud.google.com/apis/library/youtube.googleapis.com)
- Once enabled generate a Credential key for this API
- Paste your key after `GOOGLE_APIS_KEY=` in your `collector/.env` file.
### Using ther Twitter API
***required to get data form twitter with tweepy**
- Go to https://developer.twitter.com/en/portal/dashboard with your twitter account
- Create a new Project App
- Get your 4 keys and place them in your `collector.env` file
* TW_CONSUMER_KEY
* TW_CONSUMER_SECRET
* TW_ACCESS_TOKEN
* TW_ACCESS_TOKEN_SECRET
populate the .env with the values

View File

@ -1,32 +0,0 @@
import os
from flask import Flask, json, request
from scripts.watch.process_single import process_single
from scripts.watch.filetypes import ACCEPTED_MIMES
from scripts.link import process_single_link
api = Flask(__name__)
WATCH_DIRECTORY = "hotdir"
@api.route('/process', methods=['POST'])
def process_file():
content = request.json
target_filename = os.path.normpath(content.get('filename')).lstrip(os.pardir + os.sep)
print(f"Processing {target_filename}")
success, reason = process_single(WATCH_DIRECTORY, target_filename)
return json.dumps({'filename': target_filename, 'success': success, 'reason': reason})
@api.route('/process-link', methods=['POST'])
async def process_link():
content = request.json
url = content.get('link')
print(f"Processing {url}")
success, reason = await process_single_link(url)
return json.dumps({'url': url, 'success': success, 'reason': reason})
@api.route('/accepts', methods=['GET'])
def get_accepted_filetypes():
return json.dumps(ACCEPTED_MIMES)
@api.route('/', methods=['GET'])
def root():
return "<p>Use POST /process with filename key in JSON body in order to process a file. File by that name must exist in hotdir already.</p>"

View File

@ -1,17 +1,3 @@
### What is the "Hot directory"
This is the location where you can dump all supported file types and have them automatically converted and prepared to be digested by the vectorizing service and selected from the AnythingLLM frontend.
Files dropped in here will only be processed when you are running `python watch.py` from the `collector` directory.
Once converted the original file will be moved to the `hotdir/processed` folder so that the original document is still able to be linked to when referenced when attached as a source document during chatting.
**Supported File types**
- `.md`
- `.txt`
- `.pdf`
__requires more development__
- `.png .jpg etc`
- `.mp3`
- `.mp4`
This is a pre-set file location that documents will be written to when uploaded by AnythingLLM. There is really no need to touch it.

78
collector/index.js Normal file
View File

@ -0,0 +1,78 @@
process.env.NODE_ENV === "development"
? require("dotenv").config({ path: `.env.${process.env.NODE_ENV}` })
: require("dotenv").config();
const express = require("express");
const bodyParser = require("body-parser");
const cors = require("cors");
const path = require("path");
const { ACCEPTED_MIMES } = require("./utils/constants");
const { reqBody } = require("./utils/http");
const { processSingleFile } = require("./processSingleFile");
const { processLink } = require("./processLink");
const app = express();
app.use(cors({ origin: true }));
app.use(
bodyParser.text(),
bodyParser.json(),
bodyParser.urlencoded({
extended: true,
})
);
app.post("/process", async function (request, response) {
const { filename } = reqBody(request);
try {
const targetFilename = path
.normalize(filename)
.replace(/^(\.\.(\/|\\|$))+/, "");
const { success, reason } = await processSingleFile(targetFilename);
response.status(200).json({ filename: targetFilename, success, reason });
} catch (e) {
console.error(e);
response.status(200).json({
filename: filename,
success: false,
reason: "A processing error occurred.",
});
}
return;
});
app.post("/process-link", async function (request, response) {
const { link } = reqBody(request);
try {
const { success, reason } = await processLink(link);
response.status(200).json({ url: link, success, reason });
} catch (e) {
console.error(e);
response.status(200).json({
url: link,
success: false,
reason: "A processing error occurred.",
});
}
return;
});
app.get("/accepts", function (_, response) {
response.status(200).json(ACCEPTED_MIMES);
});
app.all("*", function (_, response) {
response.sendStatus(200);
});
app
.listen(8888, async () => {
console.log(`Document processor app listening on port 8888`);
})
.on("error", function (_) {
process.once("SIGUSR2", function () {
process.kill(process.pid, "SIGUSR2");
});
process.on("SIGINT", function () {
process.kill(process.pid, "SIGINT");
});
});

View File

@ -1,84 +0,0 @@
import os
from InquirerPy import inquirer
from scripts.youtube import youtube
from scripts.link import link, links, crawler
from scripts.substack import substack
from scripts.medium import medium
from scripts.gitbook import gitbook
from scripts.sitemap import sitemap
from scripts.twitter import twitter
def main():
if os.name == 'nt':
methods = {
'1': 'YouTube Channel',
'2': 'Article or Blog Link',
'3': 'Substack',
'4': 'Medium',
'5': 'Gitbook',
'6': 'Twitter',
'7': 'Sitemap',
}
print("There are options for data collection to make this easier for you.\nType the number of the method you wish to execute.")
print("1. YouTube Channel\n2. Article or Blog Link (Single)\n3. Substack\n4. Medium\n\n[In development]:\nTwitter\n\n")
selection = input("Your selection: ")
method = methods.get(str(selection))
else:
method = inquirer.select(
message="What kind of data would you like to add to convert into long-term memory?",
choices=[
{"name": "YouTube Channel", "value": "YouTube Channel"},
{"name": "Substack", "value": "Substack"},
{"name": "Medium", "value": "Medium"},
{"name": "Article or Blog Link(s)", "value": "Article or Blog Link(s)"},
{"name": "Gitbook", "value": "Gitbook"},
{"name": "Twitter", "value": "Twitter"},
{"name": "Sitemap", "value": "Sitemap"},
{"name": "Abort", "value": "Abort"},
],
).execute()
if 'Article or Blog Link' in method:
method = inquirer.select(
message="Do you want to scrape a single article/blog/url or many at once?",
choices=[
{"name": "Single URL", "value": "Single URL"},
{"name": "Multiple URLs", "value": "Multiple URLs"},
{"name": "URL Crawler", "value": "URL Crawler"},
{"name": "Abort", "value": "Abort"},
],
).execute()
if method == 'Single URL':
link()
exit(0)
if method == 'Multiple URLs':
links()
exit(0)
if method == 'URL Crawler':
crawler()
exit(0)
if method == 'Abort': exit(0)
if method == 'YouTube Channel':
youtube()
exit(0)
if method == 'Substack':
substack()
exit(0)
if method == 'Medium':
medium()
exit(0)
if method == 'Gitbook':
gitbook()
exit(0)
if method == 'Sitemap':
sitemap()
exit(0)
if method == 'Twitter':
twitter()
exit(0)
print("Selection was not valid.")
exit(1)
if __name__ == "__main__":
main()

3
collector/nodemon.json Normal file
View File

@ -0,0 +1,3 @@
{
"events": {}
}

42
collector/package.json Normal file
View File

@ -0,0 +1,42 @@
{
"name": "anything-llm-document-collector",
"version": "0.2.0",
"description": "Document collector server endpoints",
"main": "index.js",
"author": "Timothy Carambat (Mintplex Labs)",
"license": "MIT",
"private": false,
"engines": {
"node": ">=18.12.1"
},
"scripts": {
"dev": "NODE_ENV=development nodemon --trace-warnings index.js",
"start": "NODE_ENV=production node index.js",
"lint": "yarn prettier --write ./processSingleFile ./processLink ./utils index.js"
},
"dependencies": {
"@googleapis/youtube": "^9.0.0",
"bcrypt": "^5.1.0",
"body-parser": "^1.20.2",
"cors": "^2.8.5",
"dotenv": "^16.0.3",
"express": "^4.18.2",
"extract-zip": "^2.0.1",
"js-tiktoken": "^1.0.8",
"langchain": "0.0.201",
"mammoth": "^1.6.0",
"mbox-parser": "^1.0.1",
"mime": "^3.0.0",
"moment": "^2.29.4",
"multer": "^1.4.5-lts.1",
"officeparser": "^4.0.5",
"pdf-parse": "^1.1.1",
"puppeteer": "^21.6.1",
"slugify": "^1.6.6",
"uuid": "^9.0.0"
},
"devDependencies": {
"nodemon": "^2.0.22",
"prettier": "^2.4.1"
}
}

View File

@ -0,0 +1,72 @@
const { v4 } = require("uuid");
const {
PuppeteerWebBaseLoader,
} = require("langchain/document_loaders/web/puppeteer");
const { writeToServerDocuments } = require("../../utils/files");
const { tokenizeString } = require("../../utils/tokenizer");
const { default: slugify } = require("slugify");
async function scrapeGenericUrl(link) {
console.log(`-- Working URL ${link} --`);
const content = await getPageContent(link);
if (!content.length) {
console.error(`Resulting URL content was empty at ${link}.`);
return { success: false, reason: `No URL content found at ${link}.` };
}
const url = new URL(link);
const filename = (url.host + "-" + url.pathname).replace(".", "_");
data = {
id: v4(),
url: "file://" + slugify(filename) + ".html",
title: slugify(filename) + ".html",
docAuthor: "no author found",
description: "No description found.",
docSource: "URL link uploaded by the user.",
chunkSource: slugify(link) + ".html",
published: new Date().toLocaleString(),
wordCount: content.split(" ").length,
pageContent: content,
token_count_estimate: tokenizeString(content).length,
};
writeToServerDocuments(data, `url-${slugify(filename)}-${data.id}`);
console.log(`[SUCCESS]: URL ${link} converted & ready for embedding.\n`);
return { success: true, reason: null };
}
async function getPageContent(link) {
try {
let pageContents = [];
const loader = new PuppeteerWebBaseLoader(link, {
launchOptions: {
headless: "new",
},
gotoOptions: {
waitUntil: "domcontentloaded",
},
async evaluate(page, browser) {
const result = await page.evaluate(() => document.body.innerText);
await browser.close();
return result;
},
});
const docs = await loader.load();
for (const doc of docs) {
pageContents.push(doc.pageContent);
}
return pageContents.join(" ");
} catch (error) {
console.error("getPageContent failed!", error);
}
return null;
}
module.exports = {
scrapeGenericUrl,
};

View File

@ -0,0 +1,11 @@
const { validURL } = require("../utils/url");
const { scrapeGenericUrl } = require("./convert/generic");
async function processLink(link) {
if (!validURL(link)) return { success: false, reason: "Not a valid URL." };
return await scrapeGenericUrl(link);
}
module.exports = {
processLink,
};

View File

@ -0,0 +1,51 @@
const { v4 } = require("uuid");
const { DocxLoader } = require("langchain/document_loaders/fs/docx");
const {
createdDate,
trashFile,
writeToServerDocuments,
} = require("../../utils/files");
const { tokenizeString } = require("../../utils/tokenizer");
const { default: slugify } = require("slugify");
async function asDocX({ fullFilePath = "", filename = "" }) {
const loader = new DocxLoader(fullFilePath);
console.log(`-- Working ${filename} --`);
let pageContent = [];
const docs = await loader.load();
for (const doc of docs) {
console.log(doc.metadata);
console.log(`-- Parsing content from docx page --`);
if (!doc.pageContent.length) continue;
pageContent.push(doc.pageContent);
}
if (!pageContent.length) {
console.error(`Resulting text content was empty for ${filename}.`);
trashFile(fullFilePath);
return { success: false, reason: `No text content found in ${filename}.` };
}
const content = pageContent.join("");
data = {
id: v4(),
url: "file://" + fullFilePath,
title: filename,
docAuthor: "no author found",
description: "No description found.",
docSource: "pdf file uploaded by the user.",
chunkSource: filename,
published: createdDate(fullFilePath),
wordCount: content.split(" ").length,
pageContent: content,
token_count_estimate: tokenizeString(content).length,
};
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
trashFile(fullFilePath);
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
return { success: true, reason: null };
}
module.exports = asDocX;

View File

@ -0,0 +1,65 @@
const { v4 } = require("uuid");
const fs = require("fs");
const { mboxParser } = require("mbox-parser");
const {
createdDate,
trashFile,
writeToServerDocuments,
} = require("../../utils/files");
const { tokenizeString } = require("../../utils/tokenizer");
const { default: slugify } = require("slugify");
async function asMbox({ fullFilePath = "", filename = "" }) {
console.log(`-- Working ${filename} --`);
const mails = await mboxParser(fs.createReadStream(fullFilePath))
.then((mails) => mails)
.catch((error) => {
console.log(`Could not parse mail items`, error);
return [];
});
if (!mails.length) {
console.error(`Resulting mail items was empty for ${filename}.`);
trashFile(fullFilePath);
return { success: false, reason: `No mail items found in ${filename}.` };
}
let item = 1;
for (const mail of mails) {
if (!mail.hasOwnProperty("text")) continue;
const content = mail.text;
if (!content) continue;
console.log(
`-- Working on message "${mail.subject || "Unknown subject"}" --`
);
data = {
id: v4(),
url: "file://" + fullFilePath,
title: mail?.subject
? slugify(mail?.subject?.replace(".", "")) + ".mbox"
: `msg_${item}-${filename}`,
docAuthor: mail?.from?.text,
description: "No description found.",
docSource: "Mbox message file uploaded by the user.",
chunkSource: filename,
published: createdDate(fullFilePath),
wordCount: content.split(" ").length,
pageContent: content,
token_count_estimate: tokenizeString(content).length,
};
item++;
writeToServerDocuments(data, `${slugify(filename)}-${data.id}-msg-${item}`);
}
trashFile(fullFilePath);
console.log(
`[SUCCESS]: ${filename} messages converted & ready for embedding.\n`
);
return { success: true, reason: null };
}
module.exports = asMbox;

View File

@ -0,0 +1,46 @@
const { v4 } = require("uuid");
const officeParser = require("officeparser");
const {
createdDate,
trashFile,
writeToServerDocuments,
} = require("../../utils/files");
const { tokenizeString } = require("../../utils/tokenizer");
const { default: slugify } = require("slugify");
async function asOfficeMime({ fullFilePath = "", filename = "" }) {
console.log(`-- Working ${filename} --`);
let content = "";
try {
content = await officeParser.parseOfficeAsync(fullFilePath);
} catch (error) {
console.error(`Could not parse office or office-like file`, error);
}
if (!content.length) {
console.error(`Resulting text content was empty for ${filename}.`);
trashFile(fullFilePath);
return { success: false, reason: `No text content found in ${filename}.` };
}
data = {
id: v4(),
url: "file://" + fullFilePath,
title: filename,
docAuthor: "no author found",
description: "No description found.",
docSource: "Office file uploaded by the user.",
chunkSource: filename,
published: createdDate(fullFilePath),
wordCount: content.split(" ").length,
pageContent: content,
token_count_estimate: tokenizeString(content).length,
};
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
trashFile(fullFilePath);
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
return { success: true, reason: null };
}
module.exports = asOfficeMime;

View File

@ -0,0 +1,56 @@
const { v4 } = require("uuid");
const { PDFLoader } = require("langchain/document_loaders/fs/pdf");
const {
createdDate,
trashFile,
writeToServerDocuments,
} = require("../../utils/files");
const { tokenizeString } = require("../../utils/tokenizer");
const { default: slugify } = require("slugify");
async function asPDF({ fullFilePath = "", filename = "" }) {
const pdfLoader = new PDFLoader(fullFilePath, {
splitPages: true,
});
console.log(`-- Working ${filename} --`);
const pageContent = [];
const docs = await pdfLoader.load();
for (const doc of docs) {
console.log(
`-- Parsing content from pg ${
doc.metadata?.loc?.pageNumber || "unknown"
} --`
);
if (!doc.pageContent.length) continue;
pageContent.push(doc.pageContent);
}
if (!pageContent.length) {
console.error(`Resulting text content was empty for ${filename}.`);
trashFile(fullFilePath);
return { success: false, reason: `No text content found in ${filename}.` };
}
const content = pageContent.join("");
data = {
id: v4(),
url: "file://" + fullFilePath,
title: docs[0]?.metadata?.pdf?.info?.Title || filename,
docAuthor: docs[0]?.metadata?.pdf?.info?.Creator || "no author found",
description: "No description found.",
docSource: "pdf file uploaded by the user.",
chunkSource: filename,
published: createdDate(fullFilePath),
wordCount: content.split(" ").length,
pageContent: content,
token_count_estimate: tokenizeString(content).length,
};
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
trashFile(fullFilePath);
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
return { success: true, reason: null };
}
module.exports = asPDF;

View File

@ -0,0 +1,46 @@
const { v4 } = require("uuid");
const fs = require("fs");
const { tokenizeString } = require("../../utils/tokenizer");
const {
createdDate,
trashFile,
writeToServerDocuments,
} = require("../../utils/files");
const { default: slugify } = require("slugify");
async function asTxt({ fullFilePath = "", filename = "" }) {
let content = "";
try {
content = fs.readFileSync(fullFilePath, "utf8");
} catch (err) {
console.error("Could not read file!", err);
}
if (!content?.length) {
console.error(`Resulting text content was empty for ${filename}.`);
trashFile(fullFilePath);
return { success: false, reason: `No text content found in ${filename}.` };
}
console.log(`-- Working ${filename} --`);
data = {
id: v4(),
url: "file://" + fullFilePath,
title: filename,
docAuthor: "Unknown", // TODO: Find a better author
description: "Unknown", // TODO: Find a better description
docSource: "a text file uploaded by the user.",
chunkSource: filename,
published: createdDate(fullFilePath),
wordCount: content.split(" ").length,
pageContent: content,
token_count_estimate: tokenizeString(content).length,
};
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
trashFile(fullFilePath);
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
return { success: true, reason: null };
}
module.exports = asTxt;

View File

@ -0,0 +1,51 @@
const path = require("path");
const fs = require("fs");
const {
WATCH_DIRECTORY,
SUPPORTED_FILETYPE_CONVERTERS,
} = require("../utils/constants");
const { trashFile } = require("../utils/files");
RESERVED_FILES = ["__HOTDIR__.md"];
async function processSingleFile(targetFilename) {
const fullFilePath = path.resolve(WATCH_DIRECTORY, targetFilename);
if (RESERVED_FILES.includes(targetFilename))
return {
success: false,
reason: "Filename is a reserved filename and cannot be processed.",
};
if (!fs.existsSync(fullFilePath))
return {
success: false,
reason: "File does not exist in upload directory.",
};
const fileExtension = path.extname(fullFilePath).toLowerCase();
if (!fileExtension) {
return {
success: false,
reason: `No file extension found. This file cannot be processed.`,
};
}
if (!Object.keys(SUPPORTED_FILETYPE_CONVERTERS).includes(fileExtension)) {
trashFile(fullFilePath);
return {
success: false,
reason: `File extension ${fileExtension} not supported for parsing.`,
};
}
const FileTypeProcessor = require(SUPPORTED_FILETYPE_CONVERTERS[
fileExtension
]);
return await FileTypeProcessor({
fullFilePath,
filename: targetFilename,
});
}
module.exports = {
processSingleFile,
};

View File

@ -1,117 +0,0 @@
about-time==4.2.1
aiohttp==3.8.4
aiosignal==1.3.1
alive-progress==3.1.2
anyio==3.7.0
appdirs==1.4.4
argilla==1.8.0
asgiref==3.7.2
async-timeout==4.0.2
attrs==23.1.0
backoff==2.2.1
beautifulsoup4==4.12.2
blinker==1.6.2
bs4==0.0.1
certifi==2023.5.7
cffi==1.15.1
chardet==5.1.0
charset-normalizer==3.1.0
click==8.1.3
commonmark==0.9.1
cryptography==41.0.1
cssselect==1.2.0
dataclasses-json==0.5.7
Deprecated==1.2.14
docx2txt==0.8
et-xmlfile==1.1.0
exceptiongroup==1.1.1
fake-useragent==1.2.1
Flask==2.3.2
frozenlist==1.3.3
grapheme==0.6.0
greenlet==2.0.2
gunicorn==20.1.0
h11==0.14.0
httpcore==0.16.3
httpx==0.23.3
idna==3.4
importlib-metadata==6.6.0
importlib-resources==5.12.0
inquirerpy==0.3.4
install==1.3.5
itsdangerous==2.1.2
Jinja2==3.1.2
joblib==1.2.0
langchain==0.0.189
lxml==4.9.2
Markdown==3.4.3
MarkupSafe==2.1.3
marshmallow==3.19.0
marshmallow-enum==1.5.1
monotonic==1.6
msg-parser==1.2.0
multidict==6.0.4
mypy-extensions==1.0.0
nltk==3.8.1
numexpr==2.8.4
numpy==1.23.5
oauthlib==3.2.2
olefile==0.46
openapi-schema-pydantic==1.2.4
openpyxl==3.1.2
packaging==23.1
pandas==1.5.3
parse==1.19.0
pdfminer.six==20221105
pfzy==0.3.4
Pillow==9.5.0
prompt-toolkit==3.0.38
pycparser==2.21
pydantic==1.10.8
pyee==8.2.2
Pygments==2.15.1
PyMuPDF==1.22.5
pypandoc==1.4
pyppeteer==1.0.2
pyquery==2.0.0
python-dateutil==2.8.2
python-docx==0.8.11
python-dotenv==0.21.1
python-magic==0.4.27
python-pptx==0.6.21
python-slugify==8.0.1
pytz==2023.3
PyYAML==6.0
regex==2023.5.5
requests==2.31.0
requests-html==0.10.0
requests-oauthlib==1.3.1
rfc3986==1.5.0
rich==13.0.1
six==1.16.0
sniffio==1.3.0
soupsieve==2.4.1
SQLAlchemy==2.0.15
tabulate==0.9.0
tenacity==8.2.2
text-unidecode==1.3
tiktoken==0.4.0
tqdm==4.65.0
tweepy==4.14.0
typer==0.9.0
typing-inspect==0.9.0
typing_extensions==4.6.3
Unidecode==1.3.6
unstructured==0.7.1
urllib3==1.26.16
uuid==1.30
w3lib==2.1.1
wcwidth==0.2.6
websockets==10.4
Werkzeug==2.3.6
wrapt==1.14.1
xlrd==2.0.1
XlsxWriter==3.1.2
yarl==1.9.2
youtube-transcript-api==0.6.0
zipp==3.15.0

View File

@ -1,44 +0,0 @@
import os, json
from langchain.document_loaders import GitbookLoader
from urllib.parse import urlparse
from datetime import datetime
from alive_progress import alive_it
from .utils import tokenize
from uuid import uuid4
def gitbook():
url = input("Enter the URL of the GitBook you want to collect: ")
if(url == ''):
print("Not a gitbook URL")
exit(1)
primary_source = urlparse(url)
output_path = f"./outputs/gitbook-logs/{primary_source.netloc}"
transaction_output_dir = f"../server/storage/documents/gitbook-{primary_source.netloc}"
if os.path.exists(output_path) == False:os.makedirs(output_path)
if os.path.exists(transaction_output_dir) == False: os.makedirs(transaction_output_dir)
loader = GitbookLoader(url, load_all_paths= primary_source.path in ['','/'])
for doc in alive_it(loader.load()):
metadata = doc.metadata
content = doc.page_content
source = urlparse(metadata.get('source'))
name = 'home' if source.path in ['','/'] else source.path.replace('/','_')
output_filename = f"doc-{name}.json"
transaction_output_filename = f"doc-{name}.json"
data = {
'id': str(uuid4()),
'url': metadata.get('source'),
'title': metadata.get('title'),
'description': metadata.get('title'),
'published': datetime.today().strftime('%Y-%m-%d %H:%M:%S'),
'wordCount': len(content),
'pageContent': content,
'token_count_estimate': len(tokenize(content))
}
with open(f"{output_path}/{output_filename}", 'w', encoding='utf-8') as file:
json.dump(data, file, ensure_ascii=True, indent=4)
with open(f"{transaction_output_dir}/{transaction_output_filename}", 'w', encoding='utf-8') as file:
json.dump(data, file, ensure_ascii=True, indent=4)

View File

@ -1,222 +0,0 @@
import os, json, tempfile
from urllib.parse import urlparse
from requests_html import HTMLSession
from langchain.document_loaders import UnstructuredHTMLLoader
from .link_utils import append_meta, AsyncHTMLSessionFixed
from .utils import tokenize, ada_v2_cost
import requests
from bs4 import BeautifulSoup
# Example Channel URL https://tim.blog/2022/08/09/nft-insider-trading-policy/
def link():
totalTokens = 0
print("[NOTICE]: The first time running this process it will download supporting libraries.\n\n")
fqdn_link = input("Paste in the URL of an online article or blog: ")
if(len(fqdn_link) == 0):
print("Invalid URL!")
exit(1)
session = HTMLSession()
req = session.get(fqdn_link)
if(req.ok == False):
print("Could not reach this url!")
exit(1)
req.html.render()
full_text = None
with tempfile.NamedTemporaryFile(mode = "w") as tmp:
tmp.write(req.html.html)
tmp.seek(0)
loader = UnstructuredHTMLLoader(tmp.name)
data = loader.load()[0]
full_text = data.page_content
tmp.close()
link = append_meta(req, full_text, True)
if(len(full_text) > 0):
totalTokens += len(tokenize(full_text))
source = urlparse(req.url)
output_filename = f"website-{source.netloc}-{source.path.replace('/','_')}.json"
output_path = f"./outputs/website-logs"
transaction_output_filename = f"website-{source.path.replace('/','_')}.json"
transaction_output_dir = f"../server/storage/documents/custom-documents"
if os.path.isdir(output_path) == False:
os.makedirs(output_path)
if os.path.isdir(transaction_output_dir) == False:
os.makedirs(transaction_output_dir)
full_text = append_meta(req, full_text)
with open(f"{output_path}/{output_filename}", 'w', encoding='utf-8') as file:
json.dump(link, file, ensure_ascii=True, indent=4)
with open(f"{transaction_output_dir}/{transaction_output_filename}", 'w', encoding='utf-8') as file:
json.dump(link, file, ensure_ascii=True, indent=4)
else:
print("Could not parse any meaningful data from this link or url.")
exit(1)
print(f"\n\n[Success]: article or link content fetched!")
print(f"////////////////////////////")
print(f"Your estimated cost to embed this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokens)} using {totalTokens} tokens.")
print(f"////////////////////////////")
exit(0)
async def process_single_link(url):
session = None
try:
print(f"Working on {url}...")
session = AsyncHTMLSessionFixed()
req = await session.get(url)
await req.html.arender()
await session.close()
if not req.ok:
return False, "Could not reach this URL."
full_text = None
with tempfile.NamedTemporaryFile(mode = "w") as tmp:
tmp.write(req.html.html)
tmp.seek(0)
loader = UnstructuredHTMLLoader(tmp.name)
data = loader.load()[0]
full_text = data.page_content
tmp.close()
if full_text:
link_meta = append_meta(req, full_text, True)
source = urlparse(req.url)
transaction_output_dir = "../server/storage/documents/custom-documents"
transaction_output_filename = f"website-{source.netloc}-{source.path.replace('/', '_')}.json"
if not os.path.isdir(transaction_output_dir):
os.makedirs(transaction_output_dir)
file_path = os.path.join(transaction_output_dir, transaction_output_filename)
with open(file_path, 'w', encoding='utf-8') as file:
json.dump(link_meta, file, ensure_ascii=False, indent=4)
return True, "Content fetched and saved."
else:
return False, "Could not parse any meaningful data from this URL."
except Exception as e:
if session is not None:
session.close() # Kill hanging session.
return False, str(e)
def crawler():
prompt = "Paste in root URI of the pages of interest: "
new_link = input(prompt)
filter_value = input("Add a filter value for the url to ensure links don't wander too far. eg: 'my-domain.com': ")
#extract this from the uri provided
root_site = urlparse(new_link).scheme + "://" + urlparse(new_link).hostname
links = []
urls = new_link
links.append(new_link)
grab = requests.get(urls)
soup = BeautifulSoup(grab.text, 'html.parser')
# traverse paragraphs from soup
for link in soup.find_all("a"):
data = link.get('href')
if (data is not None):
fullpath = data if data[0] != '/' else f"{root_site}{data}"
try:
destination = urlparse(fullpath).scheme + "://" + urlparse(fullpath).hostname + (urlparse(fullpath).path if urlparse(fullpath).path is not None else '')
if filter_value in destination:
data = destination.strip()
print (data)
links.append(data)
else:
print (data + " does not apply for linking...")
except:
print (data + " does not apply for linking...")
#parse the links found
parse_links(links)
def links():
links = []
prompt = "Paste in the URL of an online article or blog: "
done = False
while(done == False):
new_link = input(prompt)
if(len(new_link) == 0):
done = True
links = [*set(links)]
continue
links.append(new_link)
prompt = f"\n{len(links)} links in queue. Submit an empty value when done pasting in links to execute collection.\nPaste in the next URL of an online article or blog: "
if(len(links) == 0):
print("No valid links provided!")
exit(1)
parse_links(links)
# parse links from array
def parse_links(links):
totalTokens = 0
for link in links:
print(f"Working on {link}...")
session = HTMLSession()
req = session.get(link, timeout=20)
if not req.ok:
print(f"Could not reach {link} - skipping!")
continue
req.html.render(timeout=10)
full_text = None
with tempfile.NamedTemporaryFile(mode="w") as tmp:
tmp.write(req.html.html)
tmp.seek(0)
loader = UnstructuredHTMLLoader(tmp.name)
data = loader.load()[0]
full_text = data.page_content
tmp.close()
link = append_meta(req, full_text, True)
if len(full_text) > 0:
source = urlparse(req.url)
output_filename = f"website-{source.netloc}-{source.path.replace('/','_')}.json"
output_path = f"./outputs/website-logs"
transaction_output_filename = f"website-{source.path.replace('/','_')}.json"
transaction_output_dir = f"../server/storage/documents/custom-documents"
if not os.path.isdir(output_path):
os.makedirs(output_path)
if not os.path.isdir(transaction_output_dir):
os.makedirs(transaction_output_dir)
full_text = append_meta(req, full_text)
tokenCount = len(tokenize(full_text))
totalTokens += tokenCount
with open(f"{output_path}/{output_filename}", 'w', encoding='utf-8') as file:
json.dump(link, file, ensure_ascii=True, indent=4)
with open(f"{transaction_output_dir}/{transaction_output_filename}", 'w', encoding='utf-8') as file:
json.dump(link, file, ensure_ascii=True, indent=4)
req.session.close()
else:
print(f"Could not parse any meaningful data from {link}.")
continue
print(f"\n\n[Success]: {len(links)} article or link contents fetched!")
print(f"////////////////////////////")
print(f"Your estimated cost to embed this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokens)} using {totalTokens} tokens.")
print(f"////////////////////////////")

View File

@ -1,45 +0,0 @@
import json, pyppeteer
from datetime import datetime
from .watch.utils import guid
from dotenv import load_dotenv
from .watch.utils import guid
from .utils import tokenize
from requests_html import AsyncHTMLSession
load_dotenv()
def normalize_url(url):
if(url.endswith('.web')):
return url
return f"{url}.web"
def append_meta(request, text, metadata_only = False):
meta = {
'id': guid(),
'url': normalize_url(request.url),
'title': request.html.find('title', first=True).text if len(request.html.find('title')) != 0 else '',
'docAuthor': 'N/A',
'description': request.html.find('meta[name="description"]', first=True).attrs.get('content') if request.html.find('meta[name="description"]', first=True) != None else '',
'docSource': 'web page',
'chunkSource': request.url,
'published':request.html.find('meta[property="article:published_time"]', first=True).attrs.get('content') if request.html.find('meta[property="article:published_time"]', first=True) != None else datetime.today().strftime('%Y-%m-%d %H:%M:%S'),
'wordCount': len(text.split(' ')),
'pageContent': text,
'token_count_estimate':len(tokenize(text)),
}
return "Article JSON Metadata:\n"+json.dumps(meta)+"\n\n\nText Content:\n" + text if metadata_only == False else meta
class AsyncHTMLSessionFixed(AsyncHTMLSession):
"""
pip3 install websockets==6.0 --force-reinstall
"""
def __init__(self, **kwargs):
super(AsyncHTMLSessionFixed, self).__init__(**kwargs)
self.__browser_args = kwargs.get("browser_args", ["--no-sandbox"])
@property
async def browser(self):
if not hasattr(self, "_browser"):
self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, handleSIGINT=False, handleSIGTERM=False, handleSIGHUP=False, args=self.__browser_args)
return self._browser

View File

@ -1,71 +0,0 @@
import os, json
from urllib.parse import urlparse
from .utils import tokenize, ada_v2_cost
from .medium_utils import get_username, fetch_recent_publications, append_meta
from alive_progress import alive_it
# Example medium URL: https://medium.com/@yujiangtham or https://davidall.medium.com
def medium():
print("[NOTICE]: This method will only get the 10 most recent publishings.")
author_url = input("Enter the medium URL of the author you want to collect: ")
if(author_url == ''):
print("Not a valid medium.com/@author URL")
exit(1)
handle = get_username(author_url)
if(handle is None):
print("This does not appear to be a valid medium.com/@author URL")
exit(1)
publications = fetch_recent_publications(handle)
if(len(publications)==0):
print("There are no public or free publications by this creator - nothing to collect.")
exit(1)
totalTokenCount = 0
transaction_output_dir = f"../server/storage/documents/medium-{handle}"
if os.path.isdir(transaction_output_dir) == False:
os.makedirs(transaction_output_dir)
for publication in alive_it(publications):
pub_file_path = transaction_output_dir + f"/publication-{publication.get('id')}.json"
if os.path.exists(pub_file_path) == True: continue
full_text = publication.get('pageContent')
if full_text is None or len(full_text) == 0: continue
full_text = append_meta(publication, full_text)
item = {
'id': publication.get('id'),
'url': publication.get('url'),
'title': publication.get('title'),
'published': publication.get('published'),
'wordCount': len(full_text.split(' ')),
'pageContent': full_text,
}
tokenCount = len(tokenize(full_text))
item['token_count_estimate'] = tokenCount
totalTokenCount += tokenCount
with open(pub_file_path, 'w', encoding='utf-8') as file:
json.dump(item, file, ensure_ascii=True, indent=4)
print(f"[Success]: {len(publications)} scraped and fetched!")
print(f"\n\n////////////////////////////")
print(f"Your estimated cost to embed all of this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokenCount)} using {totalTokenCount} tokens.")
print(f"////////////////////////////\n\n")
exit(0)

View File

@ -1,71 +0,0 @@
import os, json, requests, re
from bs4 import BeautifulSoup
def get_username(author_url):
if '@' in author_url:
pattern = r"medium\.com/@([\w-]+)"
match = re.search(pattern, author_url)
return match.group(1) if match else None
else:
# Given subdomain
pattern = r"([\w-]+).medium\.com"
match = re.search(pattern, author_url)
return match.group(1) if match else None
def get_docid(medium_docpath):
pattern = r"medium\.com/p/([\w-]+)"
match = re.search(pattern, medium_docpath)
return match.group(1) if match else None
def fetch_recent_publications(handle):
rss_link = f"https://medium.com/feed/@{handle}"
response = requests.get(rss_link)
if(response.ok == False):
print(f"Could not fetch RSS results for author.")
return []
xml = response.content
soup = BeautifulSoup(xml, 'xml')
items = soup.find_all('item')
publications = []
if os.path.isdir("./outputs/medium-logs") == False:
os.makedirs("./outputs/medium-logs")
file_path = f"./outputs/medium-logs/medium-{handle}.json"
if os.path.exists(file_path):
with open(file_path, "r") as file:
print(f"Returning cached data for Author {handle}. If you do not wish to use stored data then delete the file for this author to allow refetching.")
return json.load(file)
for item in items:
tags = []
for tag in item.find_all('category'): tags.append(tag.text)
content = BeautifulSoup(item.find('content:encoded').text, 'html.parser')
data = {
'id': get_docid(item.find('guid').text),
'title': item.find('title').text,
'url': item.find('link').text.split('?')[0],
'tags': ','.join(tags),
'published': item.find('pubDate').text,
'pageContent': content.get_text()
}
publications.append(data)
with open(file_path, 'w+', encoding='utf-8') as json_file:
json.dump(publications, json_file, ensure_ascii=True, indent=2)
print(f"{len(publications)} articles found for author medium.com/@{handle}. Saved to medium-logs/medium-{handle}.json")
return publications
def append_meta(publication, text):
meta = {
'url': publication.get('url'),
'tags': publication.get('tags'),
'title': publication.get('title'),
'createdAt': publication.get('published'),
'wordCount': len(text.split(' '))
}
return "Article Metadata:\n"+json.dumps(meta)+"\n\nArticle Content:\n" + text

View File

@ -1,39 +0,0 @@
import requests
import xml.etree.ElementTree as ET
from scripts.link import parse_links
import re
def parse_sitemap(url):
response = requests.get(url)
root = ET.fromstring(response.content)
urls = []
for element in root.iter('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
for loc in element.iter('{http://www.sitemaps.org/schemas/sitemap/0.9}loc'):
if not has_extension_to_ignore(loc.text):
urls.append(loc.text)
else:
print(f"Skipping filetype: {loc.text}")
return urls
# Example sitemap URL https://www.nerdwallet.com/blog/wp-sitemap-news-articles-1.xml
def sitemap():
sitemap_url = input("Enter the URL of the sitemap: ")
if(len(sitemap_url) == 0):
print("No valid sitemap provided!")
exit(1)
url_array = parse_sitemap(sitemap_url)
#parse links from array
parse_links(url_array)
def has_extension_to_ignore(string):
image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.pdf']
pattern = r'\b(' + '|'.join(re.escape(ext) for ext in image_extensions) + r')\b'
match = re.search(pattern, string, re.IGNORECASE)
return match is not None

View File

@ -1,78 +0,0 @@
import os, json
from urllib.parse import urlparse
from .utils import tokenize, ada_v2_cost
from .substack_utils import fetch_all_publications, only_valid_publications, get_content, append_meta
from alive_progress import alive_it
# Example substack URL: https://swyx.substack.com/
def substack():
author_url = input("Enter the substack URL of the author you want to collect: ")
if(author_url == ''):
print("Not a valid author.substack.com URL")
exit(1)
source = urlparse(author_url)
if('substack.com' not in source.netloc or len(source.netloc.split('.')) != 3):
print("This does not appear to be a valid author.substack.com URL")
exit(1)
subdomain = source.netloc.split('.')[0]
publications = fetch_all_publications(subdomain)
valid_publications = only_valid_publications(publications)
if(len(valid_publications)==0):
print("There are no public or free preview newsletters by this creator - nothing to collect.")
exit(1)
print(f"{len(valid_publications)} of {len(publications)} publications are readable publically text posts - collecting those.")
totalTokenCount = 0
transaction_output_dir = f"../server/storage/documents/substack-{subdomain}"
if os.path.isdir(transaction_output_dir) == False:
os.makedirs(transaction_output_dir)
for publication in alive_it(valid_publications):
pub_file_path = transaction_output_dir + f"/publication-{publication.get('id')}.json"
if os.path.exists(pub_file_path) == True: continue
full_text = get_content(publication.get('canonical_url'))
if full_text is None or len(full_text) == 0: continue
full_text = append_meta(publication, full_text)
item = {
'id': publication.get('id'),
'url': publication.get('canonical_url'),
'thumbnail': publication.get('cover_image'),
'title': publication.get('title'),
'subtitle': publication.get('subtitle'),
'description': publication.get('description'),
'published': publication.get('post_date'),
'wordCount': publication.get('wordcount'),
'pageContent': full_text,
}
tokenCount = len(tokenize(full_text))
item['token_count_estimate'] = tokenCount
totalTokenCount += tokenCount
with open(pub_file_path, 'w', encoding='utf-8') as file:
json.dump(item, file, ensure_ascii=True, indent=4)
print(f"[Success]: {len(valid_publications)} scraped and fetched!")
print(f"\n\n////////////////////////////")
print(f"Your estimated cost to embed all of this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokenCount)} using {totalTokenCount} tokens.")
print(f"////////////////////////////\n\n")
exit(0)

View File

@ -1,88 +0,0 @@
import os, json, requests, tempfile
from requests_html import HTMLSession
from langchain.document_loaders import UnstructuredHTMLLoader
from .watch.utils import guid
def fetch_all_publications(subdomain):
file_path = f"./outputs/substack-logs/substack-{subdomain}.json"
if os.path.isdir("./outputs/substack-logs") == False:
os.makedirs("./outputs/substack-logs")
if os.path.exists(file_path):
with open(file_path, "r") as file:
print(f"Returning cached data for substack {subdomain}.substack.com. If you do not wish to use stored data then delete the file for this newsletter to allow refetching.")
return json.load(file)
collecting = True
offset = 0
publications = []
while collecting is True:
url = f"https://{subdomain}.substack.com/api/v1/archive?sort=new&offset={offset}"
response = requests.get(url)
if(response.ok == False):
print("Bad response - exiting collection")
collecting = False
continue
data = response.json()
if(len(data) ==0 ):
collecting = False
continue
for publication in data:
publications.append(publication)
offset = len(publications)
with open(file_path, 'w+', encoding='utf-8') as json_file:
json.dump(publications, json_file, ensure_ascii=True, indent=2)
print(f"{len(publications)} publications found for author {subdomain}.substack.com. Saved to substack-logs/channel-{subdomain}.json")
return publications
def only_valid_publications(publications= []):
valid_publications = []
for publication in publications:
is_paid = publication.get('audience') != 'everyone'
if (is_paid and publication.get('should_send_free_preview') != True) or publication.get('type') != 'newsletter': continue
valid_publications.append(publication)
return valid_publications
def get_content(article_link):
print(f"Fetching {article_link}")
if(len(article_link) == 0):
print("Invalid URL!")
return None
session = HTMLSession()
req = session.get(article_link)
if(req.ok == False):
print("Could not reach this url!")
return None
req.html.render()
full_text = None
with tempfile.NamedTemporaryFile(mode = "w") as tmp:
tmp.write(req.html.html)
tmp.seek(0)
loader = UnstructuredHTMLLoader(tmp.name)
data = loader.load()[0]
full_text = data.page_content
tmp.close()
return full_text
def append_meta(publication, text):
meta = {
'id': guid(),
'url': publication.get('canonical_url'),
'thumbnail': publication.get('cover_image'),
'title': publication.get('title'),
'subtitle': publication.get('subtitle'),
'description': publication.get('description'),
'createdAt': publication.get('post_date'),
'wordCount': publication.get('wordcount')
}
return "Newsletter Metadata:\n"+json.dumps(meta)+"\n\nArticle Content:\n" + text

View File

@ -1,103 +0,0 @@
"""
Tweepy implementation of twitter reader. Requires the 4 twitter keys to operate.
"""
import tweepy
import os, time
import pandas as pd
import json
from .utils import tokenize, ada_v2_cost
from .watch.utils import guid
def twitter():
#get user and number of tweets to read
username = input("user timeline to read from (blank to ignore): ")
searchQuery = input("Search term, or leave blank to get user tweets (blank to ignore): ")
tweetCount = input("Gather the last number of tweets: ")
# Read your API keys to call the API.
consumer_key = os.environ.get("TW_CONSUMER_KEY")
consumer_secret = os.environ.get("TW_CONSUMER_SECRET")
access_token = os.environ.get("TW_ACCESS_TOKEN")
access_token_secret = os.environ.get("TW_ACCESS_TOKEN_SECRET")
# Check if any of the required environment variables is missing.
if not consumer_key or not consumer_secret or not access_token or not access_token_secret:
raise EnvironmentError("One of the twitter API environment variables are missing.")
# Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
consumer_key, consumer_secret, access_token, access_token_secret
)
# Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)
try:
if (searchQuery == ''):
tweets = api.user_timeline(screen_name=username, tweet_mode = 'extended', count=tweetCount)
else:
tweets = api.search_tweets(q=searchQuery, tweet_mode = 'extended', count=tweetCount)
# Pulling Some attributes from the tweet
attributes_container = [
[tweet.id, tweet.user.screen_name, tweet.created_at, tweet.favorite_count, tweet.source, tweet.full_text]
for tweet in tweets
]
# Creation of column list to rename the columns in the dataframe
columns = ["id", "Screen Name", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
# Creation of Dataframe
tweets_df = pd.DataFrame(attributes_container, columns=columns)
totalTokens = 0
for index, row in tweets_df.iterrows():
meta_link = twitter_meta(row, True)
output_filename = f"twitter-{username}-{row['Date Created']}.json"
output_path = f"./outputs/twitter-logs"
transaction_output_filename = f"tweet-{username}-{row['id']}.json"
transaction_output_dir = f"../server/storage/documents/twitter-{username}"
if not os.path.isdir(output_path):
os.makedirs(output_path)
if not os.path.isdir(transaction_output_dir):
os.makedirs(transaction_output_dir)
full_text = twitter_meta(row)
tokenCount = len(tokenize(full_text))
meta_link['pageContent'] = full_text
meta_link['token_count_estimate'] = tokenCount
totalTokens += tokenCount
with open(f"{output_path}/{output_filename}", 'w', encoding='utf-8') as file:
json.dump(meta_link, file, ensure_ascii=True, indent=4)
with open(f"{transaction_output_dir}/{transaction_output_filename}", 'w', encoding='utf-8') as file:
json.dump(meta_link, file, ensure_ascii=True, indent=4)
# print(f"{transaction_output_dir}/{transaction_output_filename}")
print(f"{tokenCount} tokens written over {tweets_df.shape[0]} records.")
except BaseException as e:
print("Status Failed: ", str(e))
time.sleep(3)
def twitter_meta(row, metadata_only = False):
# Note that /anyuser is a known twitter hack for not knowing the user's handle
# https://stackoverflow.com/questions/897107/can-i-fetch-the-tweet-from-twitter-if-i-know-the-tweets-id
url = f"http://twitter.com/anyuser/status/{row['id']}"
title = f"Tweet {row['id']}"
meta = {
'id': guid(),
'url': url,
'title': title,
'description': 'Tweet from ' + row["Screen Name"],
'published': row["Date Created"].strftime('%Y-%m-%d %H:%M:%S'),
'wordCount': len(row["Tweet"]),
}
return "Tweet JSON Metadata:\n"+json.dumps(meta)+"\n\n\nText Content:\n" + row["Tweet"] if metadata_only == False else meta

View File

@ -1,10 +0,0 @@
import tiktoken
encoder = tiktoken.encoding_for_model("text-embedding-ada-002")
def tokenize(fullText):
return encoder.encode(fullText)
def ada_v2_cost(tokenCount):
rate_per = 0.0004 / 1_000 # $0.0004 / 1K tokens
total = tokenCount * rate_per
return '${:,.2f}'.format(total) if total >= 0.01 else '< $0.01'

View File

@ -1,78 +0,0 @@
import os
from langchain.document_loaders import Docx2txtLoader, UnstructuredODTLoader
from slugify import slugify
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
from ...utils import tokenize
# Process all text-related documents.
def as_docx(**kwargs):
parent_dir = kwargs.get('directory', 'hotdir')
filename = kwargs.get('filename')
ext = kwargs.get('ext', '.txt')
remove = kwargs.get('remove_on_complete', False)
fullpath = f"{parent_dir}/{filename}{ext}"
loader = Docx2txtLoader(fullpath)
data = loader.load()[0]
content = data.page_content
if len(content) == 0:
print(f"Resulting text content was empty for {filename}{ext}.")
return(False, f"No text content found in {filename}{ext}")
print(f"-- Working {fullpath} --")
data = {
'id': guid(),
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
'title': f"{filename}{ext}",
'docAuthor': 'Unknown', # TODO: Find a better author
'description': 'Unknown', # TODO: Find a better bescription
'docSource': 'Docx Text file uploaded by the user.',
'chunkSource': f"{filename}{ext}",
'published': file_creation_time(fullpath),
'wordCount': len(content),
'pageContent': content,
'token_count_estimate': len(tokenize(content))
}
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
move_source(parent_dir, f"{filename}{ext}", remove=remove)
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
return(True, None)
def as_odt(**kwargs):
parent_dir = kwargs.get('directory', 'hotdir')
filename = kwargs.get('filename')
ext = kwargs.get('ext', '.txt')
remove = kwargs.get('remove_on_complete', False)
fullpath = f"{parent_dir}/{filename}{ext}"
loader = UnstructuredODTLoader(fullpath)
data = loader.load()[0]
content = data.page_content
if len(content) == 0:
print(f"Resulting text content was empty for {filename}{ext}.")
return(False, f"No text content found in {filename}{ext}")
print(f"-- Working {fullpath} --")
data = {
'id': guid(),
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
'title': f"{filename}{ext}",
'docAuthor': 'Unknown', # TODO: Find a better author
'description': 'Unknown', # TODO: Find a better bescription
'docSource': 'ODT Text file uploaded by the user.',
'chunkSource': f"{filename}{ext}",
'published': file_creation_time(fullpath),
'wordCount': len(content),
'pageContent': content,
'token_count_estimate': len(tokenize(content))
}
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
move_source(parent_dir, f"{filename}{ext}", remove=remove)
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
return(True, None)

View File

@ -1,42 +0,0 @@
import os, re
from slugify import slugify
from langchain.document_loaders import BSHTMLLoader
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
from ...utils import tokenize
# Process all html-related documents.
def as_html(**kwargs):
parent_dir = kwargs.get('directory', 'hotdir')
filename = kwargs.get('filename')
ext = kwargs.get('ext', '.html')
remove = kwargs.get('remove_on_complete', False)
fullpath = f"{parent_dir}/{filename}{ext}"
loader = BSHTMLLoader(fullpath)
document = loader.load()[0]
content = re.sub(r"\n+", "\n", document.page_content)
if len(content) == 0:
print(f"Resulting text content was empty for {filename}{ext}.")
return(False, f"No text content found in {filename}{ext}")
print(f"-- Working {fullpath} --")
data = {
'id': guid(),
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
'title': document.metadata.get('title', f"{filename}{ext}"),
'docAuthor': 'Unknown', # TODO: Find a better author
'description': 'Unknown', # TODO: Find a better description
'docSource': 'an HTML file uploaded by the user.',
'chunkSource': f"{filename}{ext}",
'published': file_creation_time(fullpath),
'wordCount': len(content),
'pageContent': content,
'token_count_estimate': len(tokenize(content))
}
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
move_source(parent_dir, f"{filename}{ext}", remove=remove)
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
return(True, None)

View File

@ -1,42 +0,0 @@
import os
from langchain.document_loaders import UnstructuredMarkdownLoader
from slugify import slugify
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
from ...utils import tokenize
# Process all text-related documents.
def as_markdown(**kwargs):
parent_dir = kwargs.get('directory', 'hotdir')
filename = kwargs.get('filename')
ext = kwargs.get('ext', '.txt')
remove = kwargs.get('remove_on_complete', False)
fullpath = f"{parent_dir}/{filename}{ext}"
loader = UnstructuredMarkdownLoader(fullpath)
data = loader.load()[0]
content = data.page_content
if len(content) == 0:
print(f"Resulting page content was empty - no text could be extracted from {filename}{ext}.")
return(False, f"No text could be extracted from {filename}{ext}.")
print(f"-- Working {fullpath} --")
data = {
'id': guid(),
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
'title': f"{filename}", # TODO: find a better metadata
'docAuthor': 'Unknown', # TODO: find a better metadata
'description': 'Unknown', # TODO: find a better metadata
'docSource': 'markdown file uploaded by the user.',
'chunkSource': f"{filename}{ext}",
'published': file_creation_time(fullpath),
'wordCount': len(content),
'pageContent': content,
'token_count_estimate': len(tokenize(content))
}
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
move_source(parent_dir, f"{filename}{ext}", remove=remove)
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
return(True, None)

View File

@ -1,124 +0,0 @@
import os
import datetime
import email.utils
import re
import quopri
import base64
from mailbox import mbox, mboxMessage
from slugify import slugify
from bs4 import BeautifulSoup
from scripts.watch.utils import (
guid,
file_creation_time,
write_to_server_documents,
move_source,
)
from scripts.utils import tokenize
def get_content(message: mboxMessage) -> str:
content = "None"
# if message.is_multipart():
for part in message.walk():
if part.get_content_type() == "text/plain":
content = part.get_payload(decode=True)
break
elif part.get_content_type() == "text/html":
soup = BeautifulSoup(part.get_payload(decode=True), "html.parser")
content = soup.get_text()
if isinstance(content, bytes):
try:
content = content.decode("utf-8")
except UnicodeDecodeError:
content = content.decode("latin-1")
return content
def parse_subject(subject: str) -> str:
# Check if subject is Quoted-Printable encoded
if subject.startswith("=?") and subject.endswith("?="):
# Extract character set and encoding information
match = re.match(r"=\?(.+)\?(.)\?(.+)\?=", subject)
if match:
charset = match.group(1)
encoding = match.group(2)
encoded_text = match.group(3)
is_quoted_printable = encoding.upper() == "Q"
is_base64 = encoding.upper() == "B"
if is_quoted_printable:
# Decode Quoted-Printable encoded text
subject = quopri.decodestring(encoded_text).decode(charset)
elif is_base64:
# Decode Base64 encoded text
subject = base64.b64decode(encoded_text).decode(charset)
return subject
# Process all mbox-related documents.
def as_mbox(**kwargs):
parent_dir = kwargs.get("directory", "hotdir")
filename = kwargs.get("filename")
ext = kwargs.get("ext", ".mbox")
remove = kwargs.get("remove_on_complete", False)
if filename is not None:
filename = str(filename)
else:
print("[ERROR]: No filename provided.")
return (False, "No filename provided.")
fullpath = f"{parent_dir}/{filename}{ext}"
print(f"-- Working {fullpath} --")
box = mbox(fullpath)
for message in box:
content = get_content(message)
content = content.strip().replace("\r\n", "\n")
if len(content) == 0:
print("[WARNING]: Mail with no content. Ignored.")
continue
date_tuple = email.utils.parsedate_tz(message["Date"])
if date_tuple:
local_date = datetime.datetime.fromtimestamp(
email.utils.mktime_tz(date_tuple)
)
date_sent = local_date.strftime("%a, %d %b %Y %H:%M:%S")
else:
date_sent = None
subject = message["Subject"]
if subject is None:
print("[WARNING]: Mail with no subject. But has content.")
subject = "None"
else:
subject = parse_subject(subject)
abs_path = os.path.abspath(
f"{parent_dir}/processed/{slugify(filename)}-{guid()}{ext}"
)
data = {
"id": guid(),
"url": f"file://{abs_path}",
"title": subject,
"docAuthor": message["From"],
"description": f"email from {message['From']} to {message['To']}",
"docSource": "mbox file uploaded by the user.",
"chunkSource": subject,
"published": file_creation_time(fullpath),
"wordCount": len(content),
"pageContent": content,
"token_count_estimate": len(tokenize(content)),
}
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
move_source(parent_dir, f"{filename}{ext}", remove=remove)
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
return (True, None)

View File

@ -1,58 +0,0 @@
import os, fitz
from langchain.document_loaders import PyMuPDFLoader # better UTF support and metadata
from slugify import slugify
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
from ...utils import tokenize
# Process all PDF-related documents.
def as_pdf(**kwargs):
parent_dir = kwargs.get('directory', 'hotdir')
filename = kwargs.get('filename')
ext = kwargs.get('ext', '.txt')
remove = kwargs.get('remove_on_complete', False)
fullpath = f"{parent_dir}/{filename}{ext}"
print(f"-- Working {fullpath} --")
loader = PyMuPDFLoader(fullpath)
pages = loader.load()
if len(pages) == 0:
print(f"{fullpath} parsing resulted in no pages - nothing to do.")
return(False, f"No pages found for {filename}{ext}!")
# Set doc to the first page so we can still get the metadata from PyMuPDF but without all the unicode issues.
doc = pages[0]
del loader
del pages
page_content = ''
for page in fitz.open(fullpath):
print(f"-- Parsing content from pg {page.number} --")
page_content += str(page.get_text('text'))
if len(page_content) == 0:
print(f"Resulting page content was empty - no text could be extracted from the document.")
return(False, f"No text content could be extracted from {filename}{ext}!")
title = doc.metadata.get('title')
author = doc.metadata.get('author')
subject = doc.metadata.get('subject')
data = {
'id': guid(),
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
'title': title if title else f"{filename}{ext}",
'docAuthor': author if author else 'No author found',
'description': subject if subject else 'No description found.',
'docSource': 'pdf file uploaded by the user.',
'chunkSource': f"{filename}{ext}",
'published': file_creation_time(fullpath),
'wordCount': len(page_content), # Technically a letter count :p
'pageContent': page_content,
'token_count_estimate': len(tokenize(page_content))
}
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
move_source(parent_dir, f"{filename}{ext}", remove=remove)
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
return(True, None)

View File

@ -1,38 +0,0 @@
import os
from slugify import slugify
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
from ...utils import tokenize
# Process all text-related documents.
def as_text(**kwargs):
parent_dir = kwargs.get('directory', 'hotdir')
filename = kwargs.get('filename')
ext = kwargs.get('ext', '.txt')
remove = kwargs.get('remove_on_complete', False)
fullpath = f"{parent_dir}/{filename}{ext}"
content = open(fullpath).read()
if len(content) == 0:
print(f"Resulting text content was empty for {filename}{ext}.")
return(False, f"No text content found in {filename}{ext}")
print(f"-- Working {fullpath} --")
data = {
'id': guid(),
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
'title': f"{filename}{ext}",
'docAuthor': 'Unknown', # TODO: Find a better author
'description': 'Unknown', # TODO: Find a better description
'docSource': 'a text file uploaded by the user.',
'chunkSource': f"{filename}{ext}",
'published': file_creation_time(fullpath),
'wordCount': len(content),
'pageContent': content,
'token_count_estimate': len(tokenize(content))
}
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
move_source(parent_dir, f"{filename}{ext}", remove=remove)
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
return(True, None)

View File

@ -1,25 +0,0 @@
from .convert.as_text import as_text
from .convert.as_markdown import as_markdown
from .convert.as_pdf import as_pdf
from .convert.as_docx import as_docx, as_odt
from .convert.as_mbox import as_mbox
from .convert.as_html import as_html
FILETYPES = {
'.txt': as_text,
'.md': as_markdown,
'.pdf': as_pdf,
'.docx': as_docx,
'.odt': as_odt,
'.mbox': as_mbox,
'.html': as_html,
}
ACCEPTED_MIMES = {
'text/plain': ['.txt', '.md'],
'text/html': ['.html'],
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
'application/vnd.oasis.opendocument.text': ['.odt'],
'application/pdf': ['.pdf'],
'application/mbox': ['.mbox'],
}

View File

@ -1,22 +0,0 @@
import os
from .filetypes import FILETYPES
from .utils import move_source
RESERVED = ['__HOTDIR__.md']
def watch_for_changes(directory):
for raw_doc in os.listdir(directory):
if os.path.isdir(f"{directory}/{raw_doc}") or raw_doc in RESERVED: continue
filename, fileext = os.path.splitext(raw_doc)
if filename in ['.DS_Store'] or fileext == '': continue
if fileext not in FILETYPES.keys():
print(f"{fileext} not a supported file type for conversion. Removing from hot directory.")
move_source(new_destination_filename=raw_doc, failed=True)
continue
FILETYPES[fileext](
directory=directory,
filename=filename,
ext=fileext,
)

View File

@ -1,35 +0,0 @@
import os
from .filetypes import FILETYPES
from .utils import move_source
RESERVED = ['__HOTDIR__.md']
# This script will do a one-off processing of a specific document that exists in hotdir.
# For this function we remove the original source document since there is no need to keep it and it will
# only occupy additional disk space.
def process_single(directory, target_doc):
if os.path.isdir(f"{directory}/{target_doc}") or target_doc in RESERVED: return (False, "Not a file")
if os.path.exists(f"{directory}/{target_doc}") is False:
print(f"{directory}/{target_doc} does not exist.")
return (False, f"{directory}/{target_doc} does not exist.")
filename, fileext = os.path.splitext(target_doc)
if filename in ['.DS_Store'] or fileext == '': return False
if fileext == '.lock':
print(f"{filename} is locked - skipping until unlocked")
return (False, f"{filename} is locked - skipping until unlocked")
if fileext not in FILETYPES.keys():
print(f"{fileext} not a supported file type for conversion. It will not be processed.")
move_source(new_destination_filename=target_doc, failed=True, remove=True)
return (False, f"{fileext} not a supported file type for conversion. It will not be processed.")
# Returns Tuple of (Boolean, String|None) of success status and possible error message.
# Error message will display to user.
return FILETYPES[fileext](
directory=directory,
filename=filename,
ext=fileext,
remove_on_complete=True # remove source document to save disk space.
)

View File

@ -1,35 +0,0 @@
import os, json
from datetime import datetime
from uuid import uuid4
def guid():
return str(uuid4())
def file_creation_time(path_to_file):
try:
if os.name == 'nt':
return datetime.fromtimestamp(os.path.getctime(path_to_file)).strftime('%Y-%m-%d %H:%M:%S')
else:
stat = os.stat(path_to_file)
return datetime.fromtimestamp(stat.st_birthtime).strftime('%Y-%m-%d %H:%M:%S')
except AttributeError:
return datetime.today().strftime('%Y-%m-%d %H:%M:%S')
def move_source(working_dir='hotdir', new_destination_filename='', failed=False, remove=False):
if remove and os.path.exists(f"{working_dir}/{new_destination_filename}"):
print(f"{new_destination_filename} deleted from filesystem")
os.remove(f"{working_dir}/{new_destination_filename}")
return
destination = f"{working_dir}/processed" if not failed else f"{working_dir}/failed"
if os.path.exists(destination) == False:
os.mkdir(destination)
os.replace(f"{working_dir}/{new_destination_filename}", f"{destination}/{new_destination_filename}")
return
def write_to_server_documents(data, filename, override_destination = None):
destination = f"../server/storage/documents/custom-documents" if override_destination == None else override_destination
if os.path.exists(destination) == False: os.makedirs(destination)
with open(f"{destination}/{filename}.json", 'w', encoding='utf-8') as file:
json.dump(data, file, ensure_ascii=True, indent=4)

View File

@ -1,55 +0,0 @@
import os, json
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter, JSONFormatter
from .utils import tokenize, ada_v2_cost
from .yt_utils import fetch_channel_video_information, get_channel_id, clean_text, append_meta, get_duration
from alive_progress import alive_it
# Example Channel URL https://www.youtube.com/channel/UCmWbhBB96ynOZuWG7LfKong
# Example Channel URL https://www.youtube.com/@mintplex
def youtube():
channel_link = input("Paste in the URL of a YouTube channel: ")
channel_id = get_channel_id(channel_link)
if channel_id == None or len(channel_id) == 0:
print("Invalid input - must be full YouTube channel URL")
exit(1)
channel_data = fetch_channel_video_information(channel_id)
transaction_output_dir = f"../server/storage/documents/youtube-{channel_data.get('channelTitle')}"
if os.path.isdir(transaction_output_dir) == False:
os.makedirs(transaction_output_dir)
print(f"\nFetching transcripts for {len(channel_data.get('items'))} videos - please wait.\nStopping and restarting will not refetch known transcripts in case there is an error.\nSaving results to: {transaction_output_dir}.")
totalTokenCount = 0
for video in alive_it(channel_data.get('items')):
video_file_path = transaction_output_dir + f"/video-{video.get('id')}.json"
if os.path.exists(video_file_path) == True:
continue
formatter = TextFormatter()
json_formatter = JSONFormatter()
try:
transcript = YouTubeTranscriptApi.get_transcript(video.get('id'))
raw_text = clean_text(formatter.format_transcript(transcript))
duration = get_duration(json_formatter.format_transcript(transcript))
if(len(raw_text) > 0):
fullText = append_meta(video, duration, raw_text)
tokenCount = len(tokenize(fullText))
video['pageContent'] = fullText
video['token_count_estimate'] = tokenCount
totalTokenCount += tokenCount
with open(video_file_path, 'w', encoding='utf-8') as file:
json.dump(video, file, ensure_ascii=True, indent=4)
except:
print("There was an issue getting the transcription of a video in the list - likely because captions are disabled. Skipping")
continue
print(f"[Success]: {len(channel_data.get('items'))} video transcripts fetched!")
print(f"\n\n////////////////////////////")
print(f"Your estimated cost to embed all of this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokenCount)} using {totalTokenCount} tokens.")
print(f"////////////////////////////\n\n")
exit(0)

View File

@ -1,122 +0,0 @@
import json, requests, os, re
from slugify import slugify
from dotenv import load_dotenv
from .watch.utils import guid
load_dotenv()
def is_yt_short(videoId):
url = 'https://www.youtube.com/shorts/' + videoId
ret = requests.head(url)
return ret.status_code == 200
def get_channel_id(channel_link):
if('@' in channel_link):
pattern = r'https?://www\.youtube\.com/(@\w+)/?'
match = re.match(pattern, channel_link)
if match is False: return None
handle = match.group(1)
print('Need to map username to channelId - this can take a while sometimes.')
response = requests.get(f"https://yt.lemnoslife.com/channels?handle={handle}", timeout=20)
if(response.ok == False):
print("Handle => ChannelId mapping endpoint is too slow - use regular youtube.com/channel URL")
return None
json_data = response.json()
return json_data.get('items')[0].get('id')
else:
pattern = r"youtube\.com/channel/([\w-]+)"
match = re.search(pattern, channel_link)
return match.group(1) if match else None
def clean_text(text):
return re.sub(r"\[.*?\]", "", text)
def append_meta(video, duration, text):
meta = {
'id': guid(),
'youtubeURL': f"https://youtube.com/watch?v={video.get('id')}",
'thumbnail': video.get('thumbnail'),
'description': video.get('description'),
'createdAt': video.get('published'),
'videoDurationInSeconds': duration,
}
return "Video JSON Metadata:\n"+json.dumps(meta, indent=4)+"\n\n\nAudio Transcript:\n" + text
def get_duration(json_str):
data = json.loads(json_str)
return data[-1].get('start')
def fetch_channel_video_information(channel_id, windowSize = 50):
if channel_id == None or len(channel_id) == 0:
print("No channel id provided!")
exit(1)
if os.path.isdir("./outputs/channel-logs") == False:
os.makedirs("./outputs/channel-logs")
file_path = f"./outputs/channel-logs/channel-{channel_id}.json"
if os.path.exists(file_path):
with open(file_path, "r") as file:
print(f"Returning cached data for channel {channel_id}. If you do not wish to use stored data then delete the file for this channel to allow refetching.")
return json.load(file)
if(os.getenv('GOOGLE_APIS_KEY') == None):
print("GOOGLE_APIS_KEY env variable not set!")
exit(1)
done = False
currentPage = None
pageTokens = []
items = []
data = {
'id': channel_id,
}
print("Fetching first page of results...")
while(done == False):
url = f"https://www.googleapis.com/youtube/v3/search?key={os.getenv('GOOGLE_APIS_KEY')}&channelId={channel_id}&part=snippet,id&order=date&type=video&maxResults={windowSize}"
if(currentPage != None):
print(f"Fetching page ${currentPage}")
url += f"&pageToken={currentPage}"
req = requests.get(url)
if(req.ok == False):
print("Could not fetch channel_id items!")
exit(1)
response = req.json()
currentPage = response.get('nextPageToken')
if currentPage in pageTokens:
print('All pages iterated and logged!')
done = True
break
for item in response.get('items'):
if 'id' in item and 'videoId' in item.get('id'):
if is_yt_short(item.get('id').get('videoId')):
print(f"Filtering out YT Short {item.get('id').get('videoId')}")
continue
if data.get('channelTitle') is None:
data['channelTitle'] = slugify(item.get('snippet').get('channelTitle'))
newItem = {
'id': item.get('id').get('videoId'),
'url': f"https://youtube.com/watch?v={item.get('id').get('videoId')}",
'title': item.get('snippet').get('title'),
'description': item.get('snippet').get('description'),
'thumbnail': item.get('snippet').get('thumbnails').get('high').get('url'),
'published': item.get('snippet').get('publishTime'),
}
items.append(newItem)
pageTokens.append(currentPage)
data['items'] = items
with open(file_path, 'w+', encoding='utf-8') as json_file:
json.dump(data, json_file, ensure_ascii=True, indent=2)
print(f"{len(items)} videos found for channel {data.get('channelTitle')}. Saved to channel-logs/channel-{channel_id}.json")
return data

50
collector/utils/asDocx.js Normal file
View File

@ -0,0 +1,50 @@
const { v4 } = require("uuid");
const { DocxLoader } = require("langchain/document_loaders/fs/docx");
const {
createdDate,
trashFile,
writeToServerDocuments,
} = require("../../utils/files");
const { tokenizeString } = require("../../utils/tokenizer");
const { default: slugify } = require("slugify");
async function asDocX({ fullFilePath = "", filename = "" }) {
const loader = new DocxLoader(fullFilePath);
console.log(`-- Working ${filename} --`);
let pageContent = [];
const docs = await loader.load();
for (const doc of docs) {
console.log(doc.metadata);
console.log(`-- Parsing content from docx page --`);
if (!doc.pageContent.length) continue;
pageContent.push(doc.pageContent);
}
if (!pageContent.length) {
console.error(`Resulting text content was empty for ${filename}.`);
return { success: false, reason: `No text content found in ${filename}.` };
}
const content = pageContent.join("");
data = {
id: v4(),
url: "file://" + fullFilePath,
title: filename,
docAuthor: "no author found",
description: "No description found.",
docSource: "pdf file uploaded by the user.",
chunkSource: filename,
published: createdDate(fullFilePath),
wordCount: content.split(" ").length,
pageContent: content,
token_count_estimate: tokenizeString(content).length,
};
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
trashFile(fullFilePath);
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
return { success: true, reason: null };
}
module.exports = asDocX;

View File

@ -0,0 +1,40 @@
const WATCH_DIRECTORY = require("path").resolve(__dirname, "../hotdir");
const ACCEPTED_MIMES = {
"text/plain": [".txt", ".md"],
"text/html": [".html"],
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": [
".docx",
],
"application/vnd.openxmlformats-officedocument.presentationml.presentation": [
".pptx",
],
"application/vnd.oasis.opendocument.text": [".odt"],
"application/vnd.oasis.opendocument.presentation": [".odp"],
"application/pdf": [".pdf"],
"application/mbox": [".mbox"],
};
const SUPPORTED_FILETYPE_CONVERTERS = {
".txt": "./convert/asTxt.js",
".md": "./convert/asTxt.js",
".html": "./convert/asTxt.js",
".pdf": "./convert/asPDF.js",
".docx": "./convert/asDocx.js",
".pptx": "./convert/asOfficeMime.js",
".odt": "./convert/asOfficeMime.js",
".odp": "./convert/asOfficeMime.js",
".mbox": "./convert/asMbox.js",
};
module.exports = {
SUPPORTED_FILETYPE_CONVERTERS,
WATCH_DIRECTORY,
ACCEPTED_MIMES,
};

View File

@ -0,0 +1,55 @@
const fs = require("fs");
const path = require("path");
function trashFile(filepath) {
if (!fs.existsSync(filepath)) return;
try {
const isDir = fs.lstatSync(filepath).isDirectory();
if (isDir) return;
} catch {
return;
}
fs.rmSync(filepath);
return;
}
function createdDate(filepath) {
try {
const { birthtimeMs, birthtime } = fs.statSync(filepath);
if (birthtimeMs === 0) throw new Error("Invalid stat for file!");
return birthtime.toLocaleString();
} catch {
return "unknown";
}
}
function writeToServerDocuments(
data = {},
filename,
destinationOverride = null
) {
const destination = destinationOverride
? path.resolve(destinationOverride)
: path.resolve(
__dirname,
"../../../server/storage/documents/custom-documents"
);
if (!fs.existsSync(destination))
fs.mkdirSync(destination, { recursive: true });
const destinationFilePath = path.resolve(destination, filename);
fs.writeFileSync(
destinationFilePath + ".json",
JSON.stringify(data, null, 4),
{ encoding: "utf-8" }
);
return;
}
module.exports = {
trashFile,
createdDate,
writeToServerDocuments,
};

View File

@ -0,0 +1,18 @@
process.env.NODE_ENV === "development"
? require("dotenv").config({ path: `.env.${process.env.NODE_ENV}` })
: require("dotenv").config();
function reqBody(request) {
return typeof request.body === "string"
? JSON.parse(request.body)
: request.body;
}
function queryParams(request) {
return request.query;
}
module.exports = {
reqBody,
queryParams,
};

View File

@ -0,0 +1,15 @@
const { getEncoding } = require("js-tiktoken");
function tokenizeString(input = "") {
try {
const encoder = getEncoding("cl100k_base");
return encoder.encode(input);
} catch (e) {
console.error("Could not tokenize string!");
return [];
}
}
module.exports = {
tokenizeString,
};

View File

@ -0,0 +1,11 @@
function validURL(url) {
try {
new URL(url);
return true;
} catch {}
return false;
}
module.exports = {
validURL,
};

View File

@ -1,21 +0,0 @@
import _thread, time
from scripts.watch.main import watch_for_changes
a_list = []
WATCH_DIRECTORY = "hotdir"
def input_thread(a_list):
input()
a_list.append(True)
def main():
_thread.start_new_thread(input_thread, (a_list,))
print(f"Watching '{WATCH_DIRECTORY}/' for new files.\n\nUpload files into this directory while this script is running to convert them.\nPress enter or crtl+c to exit script.")
while not a_list:
watch_for_changes(WATCH_DIRECTORY)
time.sleep(1)
print("Stopping watching of hot directory.")
exit(1)
if __name__ == "__main__":
main()

View File

@ -1,4 +0,0 @@
from api import api
if __name__ == '__main__':
api.run(debug=False)

2925
collector/yarn.lock Normal file

File diff suppressed because it is too large Load Diff

View File

@ -8,7 +8,7 @@ ARG ARG_GID=1000
# Install system dependencies
RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends \
curl gnupg libgfortran5 python3 python3-pip tzdata netcat \
curl gnupg libgfortran5 libgbm1 tzdata netcat \
libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 \
libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libx11-6 libx11-xcb1 libxcb1 \
libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 \
@ -21,13 +21,7 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
apt-get install -yq --no-install-recommends nodejs && \
curl -LO https://github.com/yarnpkg/yarn/releases/download/v1.22.19/yarn_1.22.19_all.deb \
&& dpkg -i yarn_1.22.19_all.deb \
&& rm yarn_1.22.19_all.deb && \
curl -LO https://github.com/jgm/pandoc/releases/download/3.1.3/pandoc-3.1.3-1-amd64.deb \
&& dpkg -i pandoc-3.1.3-1-amd64.deb \
&& rm pandoc-3.1.3-1-amd64.deb && \
rm -rf /var/lib/apt/lists/* /usr/share/icons && \
dpkg-reconfigure -f noninteractive tzdata && \
python3 -m pip install --no-cache-dir virtualenv
&& rm yarn_1.22.19_all.deb
# Create a group and user with specific UID and GID
RUN groupadd -g $ARG_GID anythingllm && \
@ -81,10 +75,7 @@ COPY --from=build-stage /app/frontend/dist ./server/public
COPY --chown=anythingllm:anythingllm ./collector/ ./collector/
# Install collector dependencies
RUN cd /app/collector && \
python3 -m virtualenv v-env && \
. v-env/bin/activate && \
pip install --no-cache-dir -r requirements.txt
RUN cd /app/collector && yarn install --production && yarn cache clean
# Migrate and Run Prisma against known schema
RUN cd ./server && npx prisma generate --schema=./prisma/schema.prisma
@ -92,7 +83,6 @@ RUN cd ./server && npx prisma migrate deploy --schema=./prisma/schema.prisma
# Setup the environment
ENV NODE_ENV=production
ENV PATH=/app/collector/v-env/bin:$PATH
# Expose the server port
EXPOSE 3001

View File

@ -24,6 +24,7 @@ export STORAGE_LOCATION="/var/lib/anythingllm" && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run -d -p 3001:3001 \
--cap-add SYS_ADMIN \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
@ -45,16 +46,6 @@ Your docker host will show the image as online once the build process is complet
## How to use the user interface
- To access the full application, visit `http://localhost:3001` in your browser.
## How to add files to my system using the standalone scripts
- Upload files from the UI in your Workspace settings
- To run the collector scripts to grab external data (articles, URLs, etc.)
- `docker exec -it --workdir=/app/collector anything-llm python main.py`
- To run the collector watch script to process files from the hotdir
- `docker exec -it --workdir=/app/collector anything-llm python watch.py`
- Upload [compliant files](../collector/hotdir/__HOTDIR__.md) to `./collector/hotdir` and they will be processed and made available in the UI.
## About UID and GID in the ENV
- The UID and GID are set to 1000 by default. This is the default user in the Docker container and on most host operating systems. If there is a mismatch between your host user UID and GID and what is set in the `.env` file, you may experience permission issues.

View File

@ -17,6 +17,8 @@ services:
args:
ARG_UID: ${UID:-1000}
ARG_GID: ${GID:-1000}
cap_add:
- SYS_ADMIN
volumes:
- "./.env:/app/server/.env"
- "../server/storage:/app/server/storage"

View File

@ -4,6 +4,6 @@
npx prisma migrate deploy --schema=./prisma/schema.prisma &&\
node /app/server/index.js
} &
{ FLASK_ENV=production FLASK_APP=wsgi.py cd collector && gunicorn --timeout 300 --workers 4 --bind 0.0.0.0:8888 wsgi:api; } &
{ node /app/collector/index.js; } &
wait -n
exit $?

View File

@ -9,10 +9,11 @@
"node": ">=18"
},
"scripts": {
"lint": "cd server && yarn lint && cd .. && cd frontend && yarn lint",
"setup": "cd server && yarn && cd ../frontend && yarn && cd .. && yarn setup:envs && yarn prisma:setup && echo \"Please run yarn dev:server and yarn dev:frontend in separate terminal tabs.\"",
"lint": "cd server && yarn lint && cd ../frontend && yarn lint && cd ../collector && yarn lint",
"setup": "cd server && yarn && cd ../collector && yarn && cd ../frontend && yarn && cd .. && yarn setup:envs && yarn prisma:setup && echo \"Please run yarn dev:server, yarn dev:collector, and yarn dev:frontend in separate terminal tabs.\"",
"setup:envs": "cp -n ./frontend/.env.example ./frontend/.env && cp -n ./server/.env.example ./server/.env.development && cp -n ./collector/.env.example ./collector/.env && cp -n ./docker/.env.example ./docker/.env && echo \"All ENV files copied!\n\"",
"dev:server": "cd server && yarn dev",
"dev:collector": "cd collector && yarn dev",
"dev:frontend": "cd frontend && yarn start",
"prisma:generate": "cd server && npx prisma generate",
"prisma:migrate": "cd server && npx prisma migrate dev --name init",

View File

@ -2,7 +2,7 @@ const { Telemetry } = require("../../../models/telemetry");
const { validApiKey } = require("../../../utils/middleware/validApiKey");
const { setupMulter } = require("../../../utils/files/multer");
const {
checkPythonAppAlive,
checkProcessorAlive,
acceptedFileTypes,
processDocument,
} = require("../../../utils/files/documentProcessor");
@ -60,14 +60,14 @@ function apiDocumentEndpoints(app) {
*/
try {
const { originalname } = request.file;
const processingOnline = await checkPythonAppAlive();
const processingOnline = await checkProcessorAlive();
if (!processingOnline) {
response
.status(500)
.json({
success: false,
error: `Python processing API is not online. Document ${originalname} will not be processed automatically.`,
error: `Document processing API is not online. Document ${originalname} will not be processed automatically.`,
})
.end();
}

View File

@ -4,7 +4,7 @@ process.env.NODE_ENV === "development"
const { viewLocalFiles } = require("../utils/files");
const { exportData, unpackAndOverwriteImport } = require("../utils/files/data");
const {
checkPythonAppAlive,
checkProcessorAlive,
acceptedFileTypes,
} = require("../utils/files/documentProcessor");
const { purgeDocument } = require("../utils/files/purgeDocument");
@ -221,7 +221,7 @@ function systemEndpoints(app) {
[validatedRequest],
async (_, response) => {
try {
const online = await checkPythonAppAlive();
const online = await checkProcessorAlive();
response.sendStatus(online ? 200 : 503);
} catch (e) {
console.log(e.message, e);

View File

@ -7,7 +7,7 @@ const { convertToChatHistory } = require("../utils/chats");
const { getVectorDbClass } = require("../utils/helpers");
const { setupMulter } = require("../utils/files/multer");
const {
checkPythonAppAlive,
checkProcessorAlive,
processDocument,
processLink,
} = require("../utils/files/documentProcessor");
@ -82,14 +82,14 @@ function workspaceEndpoints(app) {
handleUploads.single("file"),
async function (request, response) {
const { originalname } = request.file;
const processingOnline = await checkPythonAppAlive();
const processingOnline = await checkProcessorAlive();
if (!processingOnline) {
response
.status(500)
.json({
success: false,
error: `Python processing API is not online. Document ${originalname} will not be processed automatically.`,
error: `Document processing API is not online. Document ${originalname} will not be processed automatically.`,
})
.end();
return;
@ -114,14 +114,14 @@ function workspaceEndpoints(app) {
[validatedRequest],
async (request, response) => {
const { link = "" } = reqBody(request);
const processingOnline = await checkPythonAppAlive();
const processingOnline = await checkProcessorAlive();
if (!processingOnline) {
response
.status(500)
.json({
success: false,
error: `Python processing API is not online. Link ${link} will not be processed automatically.`,
error: `Document processing API is not online. Link ${link} will not be processed automatically.`,
})
.end();
return;

View File

@ -2,15 +2,15 @@
// of docker this endpoint is not exposed so it is only on the Docker instances internal network
// so no additional security is needed on the endpoint directly. Auth is done however by the express
// middleware prior to leaving the node-side of the application so that is good enough >:)
const PYTHON_API = "http://0.0.0.0:8888";
async function checkPythonAppAlive() {
return await fetch(`${PYTHON_API}`)
const PROCESSOR_API = "http://0.0.0.0:8888";
async function checkProcessorAlive() {
return await fetch(`${PROCESSOR_API}`)
.then((res) => res.ok)
.catch((e) => false);
}
async function acceptedFileTypes() {
return await fetch(`${PYTHON_API}/accepts`)
return await fetch(`${PROCESSOR_API}/accepts`)
.then((res) => {
if (!res.ok) throw new Error("Could not reach");
return res.json();
@ -21,7 +21,7 @@ async function acceptedFileTypes() {
async function processDocument(filename = "") {
if (!filename) return false;
return await fetch(`${PYTHON_API}/process`, {
return await fetch(`${PROCESSOR_API}/process`, {
method: "POST",
headers: {
"Content-Type": "application/json",
@ -41,7 +41,7 @@ async function processDocument(filename = "") {
async function processLink(link = "") {
if (!link) return false;
return await fetch(`${PYTHON_API}/process-link`, {
return await fetch(`${PROCESSOR_API}/process-link`, {
method: "POST",
headers: {
"Content-Type": "application/json",
@ -60,7 +60,7 @@ async function processLink(link = "") {
}
module.exports = {
checkPythonAppAlive,
checkProcessorAlive,
processDocument,
processLink,
acceptedFileTypes,