anything-llm/collector/README.md
Timothy Carambat c4eb46ca19
Upload and process documents via UI + document processor in docker image (#65)
* implement dnd uploader
show file upload progress
write files to hotdirector
build simple flaskAPI to process files one off

* move document processor calls to util
build out dockerfile to run both procs at the same time
update UI to check for document processor before upload
* disable pragma update on boot
* dockerfile changes

* add filetype restrictions based on python app support response and show rejected files in the UI

* cleanup

* stub migrations on boot to prevent exit condition

* update CF template for AWS deploy
2023-06-16 16:01:27 -07:00

2.7 KiB

How to collect data for vectorizing

This process should be run first. This will enable you to collect a ton of data across various sources. Currently the following services are supported:

  • YouTube Channels
  • Medium
  • Substack
  • Arbitrary Link
  • Gitbook
  • Local Files (.txt, .pdf, etc) See full list these resources are under development or require PR
  • Twitter Choices

Requirements

  • Python 3.8+
  • Google Cloud Account (for YouTube channels)
  • brew install pandoc pandoc (for .ODT document processing)

Setup

This example will be using python3.9, but will work with 3.8+. Tested on MacOs. Untested on Windows

  • install virtualenv for python3.8+ first before any other steps. python3.9 -m pip install virtualenv
  • cd collector from root directory
  • python3.9 -m virtualenv v-env
  • source v-env/bin/activate
  • pip install -r requirements.txt
  • cp .env.example .env
  • python main.py for interactive collection or python watch.py to process local documents.
  • Select the option you want and follow follow the prompts - Done!
  • run deactivate to get back to regular shell

Outputs

All JSON file data is cached in the output/ folder. This is to prevent redundant API calls to services which may have rate limits to quota caps. Clearing out the output/ folder will execute the script as if there was no cache.

As files are processed you will see data being written to both the collector/outputs folder as well as the server/documents folder. Later in this process, once you boot up the server you will then bulk vectorize this content from a simple UI!

If collection fails at any point in the process it will pick up where it last bailed out so you are not reusing credits.

How to get a Google Cloud API Key (YouTube data collection only)

required to fetch YouTube transcripts and data

  • Have a google account
  • Visit the GCP Cloud Console
  • Click on dropdown in top right > Create new project. Name it whatever you like
    • GCP Project Bar
  • Enable YouTube Data APIV3
  • Once enabled generate a Credential key for this API
  • Paste your key after GOOGLE_APIS_KEY= in your collector/.env file.

Running the document processing API locally

From the collector directory with the v-env active run flask run --host '0.0.0.0' --port 8888. Now uploads from the frontend will be processed as if you ran the watch.py script manually.

Docker: If you run this application via docker the API is already started for you and no additional action is needed.