* chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones
* chore: formatting as per yarn lint
* chore: fixing the human readable confluence url fetch baseUrl
* chore: fixing the human readable confluence url fetch baseUrl
* chore: fixing the human readable confluence url fetch baseUrl
* chore: fixing the human readable confluence url fetch baseUrl
* chore: fixing the human readable confluence url fetch baseUrl
* refactor implementation of various types of Confluence URL patterns
---------
Co-authored-by: Predrag Stojadinovic <predrag@stojadinovic.net>
Co-authored-by: Predrag Stojadinović <cope@users.noreply.github.com>
Co-authored-by: Predrag Stojadinovic <predrags@nvidia.com>
* Updated apt-packages source for devcontainer
Switched the devcontainer's package source to a different repository to
align with updated dependencies and package availability. The previous
source from 'rocker-org' is replaced with 'devcontainers-contrib', which
may offer more recent or relevant development tools.
* Subject: Centralize prettier ignores and refine
config
Body:
Centralized all prettier ignore rules by removing individual
`.prettierignore` files in subprojects and updating the root
`.prettierignore` to include previously ignored patterns, ensuring
consistency across the workspace. Additionally, the prettier
configuration was refined by making the file pattern for `.config.js`
files consistent and adjusting quote styles for better readability. All
lint scripts across the project were updated to respect the centralized
ignore path, enhancing maintainability.
The consolidation simplifies the process of managing ignore rules as the
project scales, ensuring developers can focus on writing code without
worrying about divergent formatting standards. These changes also align
with introducing comprehensive linting across multiple environments to
keep the codebase clean and consistent.
This adjustment is a foundational step towards a more streamlined and
unified code base, making it easier for new contributors to adhere to
established coding standards and reducing the cognitive load associated
with managing multiple configuration files across the project.
* unset package json changes
---------
Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com>
Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>
* chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones
* chore: formatting as per yarn lint
* chore: adding /display/ url matching to confluence data connector
* chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones
* chore: formatting as per yarn lint
* WIP data connector redesign
* new UI for data connectors complete
* remove old data connector page/cleanup imports
* cleanup of UI and imports
* Remove Youtube Transcript dep and move in-house
* lang pref default to en
---------
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
* Add more plaintext document types
org-mode, asciidoc, and reStructuredText are all text formats
Signed-off-by: Christian Romney <christian.a.romney@gmail.com>
* lint
---------
Signed-off-by: Christian Romney <christian.a.romney@gmail.com>
Co-authored-by: Christian Romney <christian.a.romney@gmail.com>
* Include links in citations
force ChunkSource key to retain this information
old links will be unsupported
* show special icons depending on source
* remove console log
* reset server documents writeTo
* Add support for fetching single document in documents folder
* Add document object to upload + support link scraping via API
* hotfixes for documentation
* update api docs
* Update build process to support multi-platform builds
Bump @lancedb/vectordb to 0.1.19 for ARM&AMD compatibility
Patch puppeteer on ARM builds because of broken chromium
resolves#539resolves#548
---------
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Implement support for GitHub codespaces and VSCode devcontainers
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
Co-authored-by: Sean Hatfield <seanhatfield5@gmail.com>
* feat: implement github repo loading
fix: purge of folders
fix: rendering of sub-files
* noshow delete on custom-documents
* Add API key support because of rate limits
* WIP for frontend of data connectors
* wip
* Add frontend form for GitHub repo data connector
* remove console.logs
block custom-documents from being deleted
* remove _meta unused arg
* Add support for ignore pathing in request
Ignore path input via tagging
* Update hint
* feat: Embed on-instance Whisper model for audio/mp4 transcribing
resolves#329
* additional logging
* add placeholder for tmp folder in collector storage
Add cleanup of hotdir and tmp on collector boot to prevent hanging files
split loading of model and file conversion into concurrency
* update README
* update model size
* update supported filetypes
* wip: init refactor of document processor to JS
* add NodeJs PDF support
* wip: partity with python processor
feat: add pptx support
* fix: forgot files
* Remove python scripts totally
* wip:update docker to boot new collector
* add package.json support
* update dockerfile for new build
* update gitignore and linting
* add more protections on file lookup
* update package.json
* test build
* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
* cosmetic changes to be compatible to hadolint
* common configuration for most editors until better plugins comes up
* Changes on PDF metadata, using PyMuPDF (faster and more compatible)
* small changes on other file ingestions in order to try to keep the fields equal
* Lint, review, and review
* fixed unknown chars
* Use PyMuPDF for pdf loading for 200% speed increase
linting
---------
Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com>
Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>
* Update filetypes.py
Added mbox format
* Created new file
Added support for mbox files as used by many email services, including Google Takeout's Gmail archive.
* Update filetypes.py
* Update as_mbox.py