* implement dnd uploader show file upload progress write files to hotdirector build simple flaskAPI to process files one off * move document processor calls to util build out dockerfile to run both procs at the same time update UI to check for document processor before upload * disable pragma update on boot * dockerfile changes * add filetype restrictions based on python app support response and show rejected files in the UI * cleanup * stub migrations on boot to prevent exit condition * update CF template for AWS deploy
2.7 KiB
How to collect data for vectorizing
This process should be run first. This will enable you to collect a ton of data across various sources. Currently the following services are supported:
- YouTube Channels
- Medium
- Substack
- Arbitrary Link
- Gitbook
- Local Files (.txt, .pdf, etc) See full list these resources are under development or require PR
Requirements
- Python 3.8+
- Google Cloud Account (for YouTube channels)
brew install pandoc
pandoc (for .ODT document processing)
Setup
This example will be using python3.9, but will work with 3.8+. Tested on MacOs. Untested on Windows
- install virtualenv for python3.8+ first before any other steps.
python3.9 -m pip install virtualenv
cd collector
from root directorypython3.9 -m virtualenv v-env
source v-env/bin/activate
pip install -r requirements.txt
cp .env.example .env
python main.py
for interactive collection orpython watch.py
to process local documents.- Select the option you want and follow follow the prompts - Done!
- run
deactivate
to get back to regular shell
Outputs
All JSON file data is cached in the output/
folder. This is to prevent redundant API calls to services which may have rate limits to quota caps. Clearing out the output/
folder will execute the script as if there was no cache.
As files are processed you will see data being written to both the collector/outputs
folder as well as the server/documents
folder. Later in this process, once you boot up the server you will then bulk vectorize this content from a simple UI!
If collection fails at any point in the process it will pick up where it last bailed out so you are not reusing credits.
How to get a Google Cloud API Key (YouTube data collection only)
required to fetch YouTube transcripts and data
- Have a google account
- Visit the GCP Cloud Console
- Click on dropdown in top right > Create new project. Name it whatever you like
- Enable YouTube Data APIV3
- Once enabled generate a Credential key for this API
- Paste your key after
GOOGLE_APIS_KEY=
in yourcollector/.env
file.
Running the document processing API locally
From the collector
directory with the v-env
active run flask run --host '0.0.0.0' --port 8888
.
Now uploads from the frontend will be processed as if you ran the watch.py
script manually.
Docker: If you run this application via docker the API is already started for you and no additional action is needed.