Commit Graph

20 Commits

Author SHA1 Message Date
Sean Hatfield
7edfccaf9a
Adding url uploads to document picker (#375)
* WIP adding url uploads to document picker

* fix manual script for uploading url to custom-documents

* fix metadata for url scraping

* wip url parsing

* update how async link scraping works

* docker-compose defaults added
no autocomplete on URLs

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2023-11-16 17:15:01 -08:00
Sean Hatfield
f40309cfdb
Add id to all metadata to prevent errors in frontend document picker (#378)
add id to all metadata to prevent errors in frontend docuemnt picker

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2023-11-16 14:36:26 -08:00
timothycarambat
1e3d82e184 patch collector script 2023-11-16 10:25:23 -08:00
timothycarambat
c5dc68633b patch link scrape tool schema 2023-11-14 16:41:39 -08:00
Timothy Carambat
5441717294
normalize parser struct for all file types (#321) 2023-11-01 16:44:02 -07:00
Francisco Bischoff
26dba59249
mbox parsing improvements v1 (#308)
* mbox parsing improvements v1

* autobots roll out!
2023-10-30 11:57:33 -07:00
Timothy Carambat
a505928934
Display better error messages from document processor (#243)
pass messages to frontend on success/failure
resolves #242
2023-09-18 16:50:20 -07:00
Timothy Carambat
3e78476739
Franzbischoff document improvements (#241)
* cosmetic changes to be compatible to hadolint

* common configuration for most editors until better plugins comes up

* Changes on PDF metadata, using PyMuPDF (faster and more compatible)

* small changes on other file ingestions in order to try to keep the fields equal

* Lint, review, and review

* fixed unknown chars

* Use PyMuPDF for pdf loading for 200% speed increase
linting

---------

Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com>
Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>
2023-09-18 16:21:37 -07:00
Timothy Carambat
b42493c6de
Split large PDFS into subfolder in documents (#176)
append time value to folder name to prevent duplicate uploads
2023-08-03 18:57:50 -07:00
AntonioCiolino
31e5db7490
Twitter Feature (#134)
* .

* twitter feature update

* Key validation and operation
2023-07-06 14:05:50 -07:00
Timothy Carambat
d7315b0e53
be able to parse relative and FQDN links from root reliabily (#138) 2023-07-05 14:40:54 -07:00
mplawner
3efe55a720
Added mbox support (#106)
* Update filetypes.py

Added mbox format

* Created new file

Added support for mbox files as used by many email services, including Google Takeout's Gmail archive.

* Update filetypes.py

* Update as_mbox.py
2023-06-25 18:11:05 -07:00
AntonioCiolino
a52b0ae655
Updated Link scraper to avoid NoneType error. (#90)
* Enable web scraping based on a urtl and a simple filter.

* ignore yarn

* Updated Link scraper to avoid NoneType error.
2023-06-19 12:07:26 -07:00
frasergr
4079020de0
dockerfile cleanup; enforce text LF line endings (#81) 2023-06-17 20:18:01 -07:00
AntonioCiolino
e7ba028497
Enable web scraping based on a urtl and a simple filter. (#73) 2023-06-16 17:29:11 -07:00
Timothy Carambat
c4eb46ca19
Upload and process documents via UI + document processor in docker image (#65)
* implement dnd uploader
show file upload progress
write files to hotdirector
build simple flaskAPI to process files one off

* move document processor calls to util
build out dockerfile to run both procs at the same time
update UI to check for document processor before upload
* disable pragma update on boot
* dockerfile changes

* add filetype restrictions based on python app support response and show rejected files in the UI

* cleanup

* stub migrations on boot to prevent exit condition

* update CF template for AWS deploy
2023-06-16 16:01:27 -07:00
Skid Vis
4118c9dcf3
Blocks images in sitemaps from being parsed. (#56)
* Adds ability to import sitemaps to include a website

* adds example sitemap url

* adds filter to bypass common image formats

* moves filetype ignoring to sitemap script
2023-06-14 23:00:03 -07:00
Skid Vis
bd32f97a21
Adds ability to import sitemaps to include a website (#51)
* Adds ability to import sitemaps to include a website

* adds example sitemap url
2023-06-14 11:04:17 -07:00
frasergr
9f33b3dfcb
Docker support (#34)
* Updates for Linux for frontend/server

* frontend/server docker

* updated Dockerfile for deps related to node vectordb

* updates for collector in docker

* docker deps for ODT processing

* ignore another collector dir

* storage mount improvements; run as UID

* fix pypandoc version typo

* permissions fixes
2023-06-13 11:26:11 -07:00
timothycarambat
27c58541bd inital commit 2023-06-03 19:28:07 -07:00