* implement custom PDFLoader to remove LC dep
* remove unneeded comment
* remove pdfjs as dep and fix page splitting using pdf-parse
* linting + export rename for desktop compat
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add more plaintext document types
org-mode, asciidoc, and reStructuredText are all text formats
Signed-off-by: Christian Romney <christian.a.romney@gmail.com>
* lint
---------
Signed-off-by: Christian Romney <christian.a.romney@gmail.com>
Co-authored-by: Christian Romney <christian.a.romney@gmail.com>
* feat: Embed on-instance Whisper model for audio/mp4 transcribing
resolves#329
* additional logging
* add placeholder for tmp folder in collector storage
Add cleanup of hotdir and tmp on collector boot to prevent hanging files
split loading of model and file conversion into concurrency
* update README
* update model size
* update supported filetypes
* wip: init refactor of document processor to JS
* add NodeJs PDF support
* wip: partity with python processor
feat: add pptx support
* fix: forgot files
* Remove python scripts totally
* wip:update docker to boot new collector
* add package.json support
* update dockerfile for new build
* update gitignore and linting
* add more protections on file lookup
* update package.json
* test build
* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch