Commit Graph

14 Commits

Author SHA1 Message Date
Sean Hatfield
b658f5012d
Support XLSX files (#2403)
* support xlsx files

* lint

* create seperate docs for each xlsx sheet

* lint

* use node-xlsx pkg for parsing xslx files

* lint

* update error handling

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-10-03 13:45:23 -07:00
Sean Hatfield
9b86bbd2b8
[FIX] PDFLoader module bug fix (#1879)
use pdf.js by importing it from pdf-parse and fix custom PDFLoader module
2024-07-16 13:09:43 -07:00
Sean Hatfield
79656718b2
[FEAT] Create custom pdfloader (#1852)
* implement custom PDFLoader to remove LC dep

* remove unneeded comment

* remove pdfjs as dep and fix page splitting using pdf-parse

* linting + export rename for desktop compat

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-07-11 12:26:11 -07:00
Sean Hatfield
a87014822a
[REFACTOR] Improve asPDF collector processor with pdfjs (#1791)
* WIP replace langchain pdfloader with pdfjs and add more context to each page

* remove extras from pdfjs and just replace langchain library

* remove unneeded dep

* fix console log in docs

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-07-03 14:26:48 -07:00
Ken Kuang
a3b7239d05
Fix Cannot read properties of undefined (reading 'length') (#1145)
Fix upload failed
2024-04-20 12:28:19 -07:00
timothycarambat
117c3b2bfb forgot epub file! 2024-04-02 14:30:20 -07:00
Sean Hatfield
45f50ce13c
[FIX] Update metadata tags in PDF collector script (#925)
update title in pdf collector script to be the filename instead of metadata title
2024-03-19 18:14:34 -07:00
Timothy Carambat
0ada882991
Support external transcription providers (#909)
* Support External Transcription providers

* patch files

* update docs

* fix return data
2024-03-14 15:43:26 -07:00
Timothy Carambat
d52f8aafd4
689 links in citation (#715)
* Include links in citations
force ChunkSource key to retain this information
old links will be unsupported

* show special icons depending on source

* remove console log

* reset server documents writeTo
2024-02-13 14:11:57 -08:00
Timothy Carambat
b35feede87
570 document api return object (#608)
* Add support for fetching single document in documents folder

* Add document object to upload + support link scraping via API

* hotfixes for documentation

* update api docs
2024-01-16 16:04:22 -08:00
timothycarambat
26549df6a9 touchup linting 2023-12-27 13:28:37 -08:00
timothycarambat
daadad3859 hoist var in extensions 2023-12-20 19:41:16 -08:00
Timothy Carambat
61db981017
feat: Embed on-instance Whisper model for audio/mp4 transcribing (#449)
* feat: Embed on-instance Whisper model for audio/mp4 transcribing
resolves #329

* additional logging

* add placeholder for tmp folder in collector storage
Add cleanup of hotdir and tmp on collector boot to prevent hanging files
split loading of model and file conversion into concurrency

* update README

* update model size

* update supported filetypes
2023-12-15 11:20:13 -08:00
Timothy Carambat
719521c307
Document Processor v2 (#442)
* wip: init refactor of document processor to JS

* add NodeJs PDF support

* wip: partity with python processor
feat: add pptx support

* fix: forgot files

* Remove python scripts totally

* wip:update docker to boot new collector

* add package.json support

* update dockerfile for new build

* update gitignore and linting

* add more protections on file lookup

* update package.json

* test build

* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
2023-12-14 15:14:56 -08:00