Commit Graph

159 Commits

Author SHA1 Message Date
Timothy Carambat
a598c8e04c
1347 human readable confluence url (#1706)
* chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones

* chore: formatting as per yarn lint

* chore: fixing the human readable confluence url fetch baseUrl

* chore: fixing the human readable confluence url fetch baseUrl

* chore: fixing the human readable confluence url fetch baseUrl

* chore: fixing the human readable confluence url fetch baseUrl

* chore: fixing the human readable confluence url fetch baseUrl

* refactor implementation of various types of Confluence URL patterns

---------

Co-authored-by: Predrag Stojadinovic <predrag@stojadinovic.net>
Co-authored-by: Predrag Stojadinović <cope@users.noreply.github.com>
Co-authored-by: Predrag Stojadinovic <predrags@nvidia.com>
2024-06-17 16:04:20 -07:00
timothycarambat
393772c4a5 Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-06-12 09:05:57 -07:00
Timothy Carambat
98cef508a6
Feature/devcontv2 (#1622)
* Updated apt-packages source for devcontainer

Switched the devcontainer's package source to a different repository to
align with updated dependencies and package availability. The previous
source from 'rocker-org' is replaced with 'devcontainers-contrib', which
may offer more recent or relevant development tools.

* Subject: Centralize prettier ignores and refine
config

Body:
Centralized all prettier ignore rules by removing individual
`.prettierignore` files in subprojects and updating the root
`.prettierignore` to include previously ignored patterns, ensuring
consistency across the workspace. Additionally, the prettier
configuration was refined by making the file pattern for `.config.js`
files consistent and adjusting quote styles for better readability. All
lint scripts across the project were updated to respect the centralized
ignore path, enhancing maintainability.

The consolidation simplifies the process of managing ignore rules as the
project scales, ensuring developers can focus on writing code without
worrying about divergent formatting standards. These changes also align
with introducing comprehensive linting across multiple environments to
keep the codebase clean and consistent.

This adjustment is a foundational step towards a more streamlined and
unified code base, making it easier for new contributors to adhere to
established coding standards and reducing the cognitive load associated
with managing multiple configuration files across the project.

* unset package json changes

---------

Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com>
Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>
2024-06-06 12:50:42 -07:00
Chris Daniel
8a4dd2bdf5
[FEAT] add support for TSX files to be parsed as text (#1597)
add support for TSX files to be parsed as text
2024-06-03 17:01:41 +08:00
Sean Hatfield
9a38b32c74
[FEAT] Add support for R files to be parsed as text (#1577)
add support for R files to be parsed as text
2024-05-31 13:52:00 +08:00
Sean Hatfield
4324a8bb4f
[FEAT] Github repo loader bug fix (#1558)
* fix project names with special characters for github repo data connector

* linting
2024-05-29 17:01:29 +08:00
timothycarambat
6e8a327d98 merge with master 2024-05-23 12:58:36 -07:00
Timothy Carambat
a89812703b
repatch path normalization (#1516) 2024-05-23 12:52:04 -07:00
timothycarambat
05488c81e0 undo path norm whitespace fix 2024-05-23 12:04:00 -07:00
timothycarambat
c6ad94d81a Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-05-22 13:43:09 -05:00
timothycarambat
e208074ef4 patch path normalization 2024-05-22 11:50:01 -05:00
timothycarambat
c65ab6d863 merge with master 2024-05-21 14:48:16 -05:00
Timothy Carambat
1a5aacb001
Support multi-model whispers (#1444) 2024-05-17 21:31:29 -07:00
Timothy Carambat
7e0b638a2c
Patch confluence URL patterns(#1426)
* patch confluence patterns

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-05-16 14:15:59 -07:00
timothycarambat
87b41a60e9 refactor spaceKey url pattern for custom domains 2024-05-16 11:01:34 -07:00
Predrag Stojadinović
cf969adf37
1362 custom display confluence url (#1423)
* chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones

* chore: formatting as per yarn lint

* chore: adding /display/ url matching to confluence data connector
2024-05-16 10:46:18 -07:00
timothycarambat
d603d0fd51 patch:update storage for bulk-website scraper for render 2024-05-14 12:59:14 -07:00
timothycarambat
c8dac6177a Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-05-14 12:57:44 -07:00
timothycarambat
b5ac944475 patch: bulk-scraper, update when folder is made and path creation params 2024-05-14 12:57:23 -07:00
timothycarambat
72c9fda6c9 Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-05-14 12:50:17 -07:00
Sean Hatfield
612a7e1662
[FEAT] Website depth scraping data connector (#1191)
* WIP website depth scraping, (sort of works)

* website depth data connector stable + add maxLinks option

* linting + loading small ui tweak

* refactor website depth data connector for stability, speed, & readability

* patch: remove console log
Guard clause on URL validitiy check
reasonable overrides

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2024-05-14 12:49:14 -07:00
jazelly
d71db22799
fix: skip undefined confluence pageContent (#1383)
Refs: https://github.com/Mintplex-Labs/anything-llm/issues/1381

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2024-05-14 10:22:13 -07:00
Predrag Stojadinović
78e3e35d27
[FEAT] Confluence Data Connector handles custom Confluence urls (#1362)
* chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones

* chore: formatting as per yarn lint
2024-05-14 10:21:04 -07:00
timothycarambat
c60077a078 merge with master 2024-05-03 10:02:53 -07:00
timothycarambat
2d215acb75 patch storage dirs for extensions 2024-05-02 14:03:10 -07:00
timothycarambat
1aa8e5766f duplicate key (no impact) 2024-05-02 13:05:20 -07:00
timothycarambat
6150ff41ea Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-05-01 13:33:07 -07:00
Timothy Carambat
547d4859ef
Bump openai package to latest (#1234)
* Bump `openai` package to latest
Tested all except localai

* bump LocalAI support with latest image

* add deprecation notice

* linting
2024-04-30 12:33:42 -07:00
Timothy Carambat
94017e2b51
bump langchain deps (#1231)
* bump langchain deps

* patch native and ollama providers remove deprecated deps

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-04-30 12:04:24 -07:00
Sean Hatfield
348b36bf85
[FEAT] Confluence data connector (#1181)
* WIP Confluence data connector backend

* confluence data connector complete

* confluence citations

* fix citation for confluence

* Patch confulence integration

* fix Citation Icon for confluence

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-04-25 17:53:38 -07:00
timothycarambat
e1372a81d4 Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-04-20 18:22:41 -07:00
Ken Kuang
a3b7239d05
Fix Cannot read properties of undefined (reading 'length') (#1145)
Fix upload failed
2024-04-20 12:28:19 -07:00
timothycarambat
45505630a6 Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-04-17 11:55:57 -07:00
Timothy Carambat
a5bb77f97a
Agent support for @agent default agent inside workspace chat (#1093)
V1 of agent support via built-in `@agent` that can be invoked alongside normal workspace RAG chat.
2024-04-16 10:50:10 -07:00
timothycarambat
fde4e5400f Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-04-12 14:57:46 -07:00
Sean Hatfield
af84b01482
[FIX] GitHub repo with periods in link fix (#1084)
fix periods in github repo links bug
2024-04-12 14:56:59 -07:00
Timothy Carambat
2c6135aa54
patch file types as plaintext (#1095)
resolves #1089
2024-04-12 14:54:33 -07:00
timothycarambat
75ced7e65a merge with master
Patch LLM selection for native to be disabled
2024-04-07 14:55:18 -07:00
Timothy Carambat
1f8ab0d245
Remove YoutubeLoader dependency (#1050)
* WIP data connector redesign

* new UI for data connectors complete

* remove old data connector page/cleanup imports

* cleanup of UI and imports

* Remove Youtube Transcript dep and move in-house

* lang pref default to en

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-04-05 16:33:01 -07:00
timothycarambat
2638098d49 patch with master 2024-04-05 09:45:28 -07:00
timothycarambat
0b454016cf patch comkey path to fallback 2024-04-04 10:47:26 -07:00
timothycarambat
a4c1d42e41 merge with master 2024-04-02 14:33:32 -07:00
timothycarambat
e524afae9e Merge branch 'master' of github.com:Mintplex-Labs/anything-llm 2024-04-02 14:30:27 -07:00
timothycarambat
117c3b2bfb forgot epub file! 2024-04-02 14:30:20 -07:00
Timothy Carambat
4fb4aa2041
Add epub support for parsing (#1017) 2024-04-02 14:25:52 -07:00
Timothy Carambat
752e3e22ed
Add more text file forced extensions (#1016) 2024-04-02 14:13:11 -07:00
Timothy Carambat
f4088d9348
RSA-Signing on server<->collector communication via API (#1005)
* WIP integrity check between processes

* Implement integrity checking on document processor payloads
2024-04-01 13:56:35 -07:00
timothycarambat
971c54e2c8 Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-03-26 14:12:09 -07:00
Sean Hatfield
45f50ce13c
[FIX] Update metadata tags in PDF collector script (#925)
update title in pdf collector script to be the filename instead of metadata title
2024-03-19 18:14:34 -07:00
timothycarambat
540d18ec84 Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-03-18 09:52:11 -07:00
Timothy Carambat
0ada882991
Support external transcription providers (#909)
* Support External Transcription providers

* patch files

* update docs

* fix return data
2024-03-14 15:43:26 -07:00
timothycarambat
429ea0c805 Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-03-12 12:29:57 -07:00
Timothy Carambat
0f31e43fd4
bump YT metadata lib for YT api fix rot (#888) 2024-03-11 10:57:53 -07:00
timothycarambat
65f8a01505 merge with master 2024-03-06 16:43:36 -08:00
Timothy Carambat
ec90060d36
Re-map some file mimes to support text (#842)
re-map some file mimes to support text
2024-02-29 10:05:03 -08:00
timothycarambat
2b6e1db79b merge with master 2024-02-27 23:12:09 -08:00
Timothy Carambat
6d18d79bb7
Generic upload fallback as text file. (#808)
* Do not block any file upload
fallback unknown/unsupported types to text if possible

* reduce call for frontend

* patch
2024-02-26 13:43:54 -08:00
timothycarambat
ae01785220 Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-02-21 15:11:45 -08:00
Timothy Carambat
d89610586a
improve error messages from YT scraping (#768)
parse & enforce URL to allow multiple URL schemas
2024-02-21 10:47:10 -08:00
Timothy Carambat
49fbd09af4
Support more plaintext filetypes (#757)
* Add more plaintext document types

org-mode, asciidoc, and reStructuredText are all text formats

Signed-off-by: Christian Romney <christian.a.romney@gmail.com>

* lint

---------

Signed-off-by: Christian Romney <christian.a.romney@gmail.com>
Co-authored-by: Christian Romney <christian.a.romney@gmail.com>
2024-02-19 10:44:01 -08:00
Timothy Carambat
d52f8aafd4
689 links in citation (#715)
* Include links in citations
force ChunkSource key to retain this information
old links will be unsupported

* show special icons depending on source

* remove console log

* reset server documents writeTo
2024-02-13 14:11:57 -08:00
Timothy Carambat
48cb8f2897
Add support to upload rawText document via api (#692)
* Add support to upload rawText document via api

* update API doc endpoint with correct textContent key

* update response swagger doc
2024-02-07 15:17:32 -08:00
Sean Hatfield
288ff0d18c
fix vector cache not deleting cache after unembedding items with folders (#630) 2024-01-22 13:03:05 -08:00
Timothy Carambat
0db6c3b2aa
Prevent private octets from link collection for self-hosted (#626) 2024-01-19 10:49:40 -08:00
timothycarambat
addb3d0c3e Update Render.com image for AnythignLLM to latest 2024-01-17 18:12:25 -08:00
Timothy Carambat
b35feede87
570 document api return object (#608)
* Add support for fetching single document in documents folder

* Add document object to upload + support link scraping via API

* hotfixes for documentation

* update api docs
2024-01-16 16:04:22 -08:00
Timothy Carambat
1563a1b20f
Strict link protocol validation (#577) 2024-01-11 12:29:00 -08:00
timothycarambat
a48a5ad6ad Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2024-01-08 17:01:23 -08:00
Timothy Carambat
58971e8b30
Build & Publish AnythingLLM for ARM64 and x86 (#549)
* Update build process to support multi-platform builds
Bump @lancedb/vectordb to 0.1.19 for ARM&AMD compatibility
Patch puppeteer on ARM builds because of broken chromium
resolves #539
resolves #548

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-01-08 16:15:01 -08:00
Francisco Bischoff
990a2e85bf
devcontainer v1 (#297)
Implement support for GitHub codespaces and VSCode devcontainers
---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
Co-authored-by: Sean Hatfield <seanhatfield5@gmail.com>
2024-01-08 15:31:06 -08:00
timothycarambat
26549df6a9 touchup linting 2023-12-27 13:28:37 -08:00
timothycarambat
daadad3859 hoist var in extensions 2023-12-20 19:41:16 -08:00
timothycarambat
1ca06cc3e1 Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into render 2023-12-19 16:23:19 -08:00
Timothy Carambat
f2fadd6d2e
Add placeholder collector ENV file (#476)
resolves #474
2023-12-19 13:27:09 -08:00
timothycarambat
0eb2fe7248 Map .env to storage .env file
map writeToServerDocuments to resolve to fixed storage mount for Render
2023-12-19 11:35:20 -08:00
Timothy Carambat
ecf4295537
Add ability to grab youtube transcripts via doc processor (#470)
* Add ability to grab youtube transcripts via doc processor

* dynamic imports
swap out Github for Youtube in placeholder text
2023-12-18 17:17:26 -08:00
Timothy Carambat
452582489e
GitHub loader extension + extension support v1 (#469)
* feat: implement github repo loading
fix: purge of folders
fix: rendering of sub-files

* noshow delete on custom-documents

* Add API key support because of rate limits

* WIP for frontend of data connectors

* wip

* Add frontend form for GitHub repo data connector

* remove console.logs
block custom-documents from being deleted

* remove _meta unused arg

* Add support for ignore pathing in request
Ignore path input via tagging

* Update hint
2023-12-18 15:48:02 -08:00
timothycarambat
d2e3506bb9 fix: transition on LLM and embedding screen
linting
2023-12-15 12:40:11 -08:00
Timothy Carambat
61db981017
feat: Embed on-instance Whisper model for audio/mp4 transcribing (#449)
* feat: Embed on-instance Whisper model for audio/mp4 transcribing
resolves #329

* additional logging

* add placeholder for tmp folder in collector storage
Add cleanup of hotdir and tmp on collector boot to prevent hanging files
split loading of model and file conversion into concurrency

* update README

* update model size

* update supported filetypes
2023-12-15 11:20:13 -08:00
Timothy Carambat
719521c307
Document Processor v2 (#442)
* wip: init refactor of document processor to JS

* add NodeJs PDF support

* wip: partity with python processor
feat: add pptx support

* fix: forgot files

* Remove python scripts totally

* wip:update docker to boot new collector

* add package.json support

* update dockerfile for new build

* update gitignore and linting

* add more protections on file lookup

* update package.json

* test build

* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
2023-12-14 15:14:56 -08:00
Timothy Carambat
da0cec7aa2
patch: remove unidecode as it was transliterating non-latin chars (#434)
resolves #298
2023-12-13 11:54:55 -08:00
Timothy Carambat
ce9233c258
feat: enable HTML uploads from UI (#422)
resolves #418
2023-12-11 14:40:33 -08:00
timothycarambat
b583aa74fd remove prints 2023-11-16 17:17:52 -08:00
Sean Hatfield
7edfccaf9a
Adding url uploads to document picker (#375)
* WIP adding url uploads to document picker

* fix manual script for uploading url to custom-documents

* fix metadata for url scraping

* wip url parsing

* update how async link scraping works

* docker-compose defaults added
no autocomplete on URLs

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2023-11-16 17:15:01 -08:00
Sean Hatfield
f40309cfdb
Add id to all metadata to prevent errors in frontend document picker (#378)
add id to all metadata to prevent errors in frontend docuemnt picker

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2023-11-16 14:36:26 -08:00
timothycarambat
1e3d82e184 patch collector script 2023-11-16 10:25:23 -08:00
timothycarambat
c5dc68633b patch link scrape tool schema 2023-11-14 16:41:39 -08:00
Timothy Carambat
5441717294
normalize parser struct for all file types (#321) 2023-11-01 16:44:02 -07:00
Francisco Bischoff
26dba59249
mbox parsing improvements v1 (#308)
* mbox parsing improvements v1

* autobots roll out!
2023-10-30 11:57:33 -07:00
Timothy Carambat
18798c5b64
prevent deletion of documents not in hotdir via director traversal (#258)
resolves #257
2023-09-29 11:04:47 -07:00
Timothy Carambat
a505928934
Display better error messages from document processor (#243)
pass messages to frontend on success/failure
resolves #242
2023-09-18 16:50:20 -07:00
Timothy Carambat
3e78476739
Franzbischoff document improvements (#241)
* cosmetic changes to be compatible to hadolint

* common configuration for most editors until better plugins comes up

* Changes on PDF metadata, using PyMuPDF (faster and more compatible)

* small changes on other file ingestions in order to try to keep the fields equal

* Lint, review, and review

* fixed unknown chars

* Use PyMuPDF for pdf loading for 200% speed increase
linting

---------

Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com>
Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>
2023-09-18 16:21:37 -07:00
Melroy van den Berg
16b8330fbf
Update requirements.txt (#185)
Upgrade fake-useragent to latest version (v1.2.1). Disclaimer: I'm the package maintainer.
2023-08-14 14:38:14 -07:00
Timothy Carambat
b42493c6de
Split large PDFS into subfolder in documents (#176)
append time value to folder name to prevent duplicate uploads
2023-08-03 18:57:50 -07:00
AntonioCiolino
31e5db7490
Twitter Feature (#134)
* .

* twitter feature update

* Key validation and operation
2023-07-06 14:05:50 -07:00
Timothy Carambat
d7315b0e53
be able to parse relative and FQDN links from root reliabily (#138) 2023-07-05 14:40:54 -07:00
mplawner
3efe55a720
Added mbox support (#106)
* Update filetypes.py

Added mbox format

* Created new file

Added support for mbox files as used by many email services, including Google Takeout's Gmail archive.

* Update filetypes.py

* Update as_mbox.py
2023-06-25 18:11:05 -07:00
AntonioCiolino
a52b0ae655
Updated Link scraper to avoid NoneType error. (#90)
* Enable web scraping based on a urtl and a simple filter.

* ignore yarn

* Updated Link scraper to avoid NoneType error.
2023-06-19 12:07:26 -07:00
frasergr
4079020de0
dockerfile cleanup; enforce text LF line endings (#81) 2023-06-17 20:18:01 -07:00
AntonioCiolino
e7ba028497
Enable web scraping based on a urtl and a simple filter. (#73) 2023-06-16 17:29:11 -07:00