anything-llm/collector
Melroy van den Berg 16b8330fbf
Update requirements.txt (#185)
Upgrade fake-useragent to latest version (v1.2.1). Disclaimer: I'm the package maintainer.
2023-08-14 14:38:14 -07:00
..
hotdir Update __HOTDIR__.md (#70) 2023-06-16 11:17:18 -07:00
scripts Split large PDFS into subfolder in documents (#176) 2023-08-03 18:57:50 -07:00
.env.example inital commit 2023-06-03 19:28:07 -07:00
.gitignore Upload and process documents via UI + document processor in docker image (#65) 2023-06-16 16:01:27 -07:00
api.py Upload and process documents via UI + document processor in docker image (#65) 2023-06-16 16:01:27 -07:00
main.py Twitter Feature (#134) 2023-07-06 14:05:50 -07:00
README.md Twitter Feature (#134) 2023-07-06 14:05:50 -07:00
requirements.txt Update requirements.txt (#185) 2023-08-14 14:38:14 -07:00
watch.py inital commit 2023-06-03 19:28:07 -07:00
wsgi.py Upload and process documents via UI + document processor in docker image (#65) 2023-06-16 16:01:27 -07:00

How to collect data for vectorizing

This process should be run first. This will enable you to collect a ton of data across various sources. Currently the following services are supported:

  • YouTube Channels
  • Medium
  • Substack
  • Arbitrary Link
  • Gitbook
  • Local Files (.txt, .pdf, etc) See full list these resources are under development or require PR
  • Twitter Choices

Requirements

  • Python 3.8+
  • Google Cloud Account (for YouTube channels)
  • brew install pandoc pandoc (for .ODT document processing)

Setup

This example will be using python3.9, but will work with 3.8+. Tested on MacOs. Untested on Windows

  • install virtualenv for python3.8+ first before any other steps. python3.9 -m pip install virtualenv
  • cd collector from root directory
  • python3.9 -m virtualenv v-env
  • source v-env/bin/activate
  • pip install -r requirements.txt
  • cp .env.example .env
  • python main.py for interactive collection or python watch.py to process local documents.
  • Select the option you want and follow follow the prompts - Done!
  • run deactivate to get back to regular shell

Outputs

All JSON file data is cached in the output/ folder. This is to prevent redundant API calls to services which may have rate limits to quota caps. Clearing out the output/ folder will execute the script as if there was no cache.

As files are processed you will see data being written to both the collector/outputs folder as well as the server/documents folder. Later in this process, once you boot up the server you will then bulk vectorize this content from a simple UI!

If collection fails at any point in the process it will pick up where it last bailed out so you are not reusing credits.

Running the document processing API locally

From the collector directory with the v-env active run flask run --host '0.0.0.0' --port 8888. Now uploads from the frontend will be processed as if you ran the watch.py script manually.

Docker: If you run this application via docker the API is already started for you and no additional action is needed.

How to get a Google Cloud API Key (YouTube data collection only)

required to fetch YouTube transcripts and data

  • Have a google account
  • Visit the GCP Cloud Console
  • Click on dropdown in top right > Create new project. Name it whatever you like
    • GCP Project Bar
  • Enable YouTube Data APIV3
  • Once enabled generate a Credential key for this API
  • Paste your key after GOOGLE_APIS_KEY= in your collector/.env file.

Using ther Twitter API

*required to get data form twitter with tweepy

  • Go to https://developer.twitter.com/en/portal/dashboard with your twitter account
  • Create a new Project App
    • Get your 4 keys and place them in your collector.env file
    • TW_CONSUMER_KEY
    • TW_CONSUMER_SECRET
    • TW_ACCESS_TOKEN
    • TW_ACCESS_TOKEN_SECRET populate the .env with the values