Commit Graph

9 Commits

Author SHA1 Message Date
timothycarambat
1e3d82e184 patch collector script 2023-11-16 10:25:23 -08:00
timothycarambat
c5dc68633b patch link scrape tool schema 2023-11-14 16:41:39 -08:00
Timothy Carambat
d7315b0e53
be able to parse relative and FQDN links from root reliabily (#138) 2023-07-05 14:40:54 -07:00
AntonioCiolino
a52b0ae655
Updated Link scraper to avoid NoneType error. (#90)
* Enable web scraping based on a urtl and a simple filter.

* ignore yarn

* Updated Link scraper to avoid NoneType error.
2023-06-19 12:07:26 -07:00
AntonioCiolino
e7ba028497
Enable web scraping based on a urtl and a simple filter. (#73) 2023-06-16 17:29:11 -07:00
Skid Vis
4118c9dcf3
Blocks images in sitemaps from being parsed. (#56)
* Adds ability to import sitemaps to include a website

* adds example sitemap url

* adds filter to bypass common image formats

* moves filetype ignoring to sitemap script
2023-06-14 23:00:03 -07:00
Skid Vis
bd32f97a21
Adds ability to import sitemaps to include a website (#51)
* Adds ability to import sitemaps to include a website

* adds example sitemap url
2023-06-14 11:04:17 -07:00
frasergr
9f33b3dfcb
Docker support (#34)
* Updates for Linux for frontend/server

* frontend/server docker

* updated Dockerfile for deps related to node vectordb

* updates for collector in docker

* docker deps for ODT processing

* ignore another collector dir

* storage mount improvements; run as UID

* fix pypandoc version typo

* permissions fixes
2023-06-13 11:26:11 -07:00
timothycarambat
27c58541bd inital commit 2023-06-03 19:28:07 -07:00