mirror of https://github.com/Mintplex-Labs/anything-llm.git synced 2024-11-05 06:20:10 +01:00

Timothy Carambat c4eb46ca19

Upload and process documents via UI + document processor in docker image (#65 )

* implement dnd uploader
show file upload progress
write files to hotdirector
build simple flaskAPI to process files one off

* move document processor calls to util
build out dockerfile to run both procs at the same time
update UI to check for document processor before upload
* disable pragma update on boot
* dockerfile changes

* add filetype restrictions based on python app support response and show rejected files in the UI

* cleanup

* stub migrations on boot to prevent exit condition

* update CF template for AWS deploy

2023-06-16 16:01:27 -07:00

2.7 KiB

Raw Blame History

How to collect data for vectorizing

This process should be run first. This will enable you to collect a ton of data across various sources. Currently the following services are supported:

YouTube Channels
Medium
Substack
Arbitrary Link
Gitbook
Local Files (.txt, .pdf, etc) See full list these resources are under development or require PR
Twitter

Requirements

Python 3.8+
Google Cloud Account (for YouTube channels)
brew install pandoc pandoc (for .ODT document processing)

Setup

This example will be using python3.9, but will work with 3.8+. Tested on MacOs. Untested on Windows

install virtualenv for python3.8+ first before any other steps. python3.9 -m pip install virtualenv
cd collector from root directory
python3.9 -m virtualenv v-env
source v-env/bin/activate
pip install -r requirements.txt
cp .env.example .env
python main.py for interactive collection or python watch.py to process local documents.
Select the option you want and follow follow the prompts - Done!
run deactivate to get back to regular shell

Outputs

All JSON file data is cached in the output/ folder. This is to prevent redundant API calls to services which may have rate limits to quota caps. Clearing out the output/ folder will execute the script as if there was no cache.

As files are processed you will see data being written to both the collector/outputs folder as well as the server/documents folder. Later in this process, once you boot up the server you will then bulk vectorize this content from a simple UI!

If collection fails at any point in the process it will pick up where it last bailed out so you are not reusing credits.

How to get a Google Cloud API Key (YouTube data collection only)

required to fetch YouTube transcripts and data

Have a google account
Visit the GCP Cloud Console
Click on dropdown in top right > Create new project. Name it whatever you like
Enable YouTube Data APIV3
Once enabled generate a Credential key for this API
Paste your key after GOOGLE_APIS_KEY= in your collector/.env file.

Running the document processing API locally

From the collector directory with the v-env active run flask run --host '0.0.0.0' --port 8888. Now uploads from the frontend will be processed as if you ran the watch.py script manually.

Docker: If you run this application via docker the API is already started for you and no additional action is needed.

2.7 KiB Raw Blame History