anything-llm

mirror of https://github.com/Mintplex-Labs/anything-llm.git synced 2024-11-19 04:30:10 +01:00

Author	SHA1	Message	Date
Sean Hatfield	612a7e1662	[FEAT] Website depth scraping data connector (#1191 ) * WIP website depth scraping, (sort of works) * website depth data connector stable + add maxLinks option * linting + loading small ui tweak * refactor website depth data connector for stability, speed, & readability * patch: remove console log Guard clause on URL validitiy check reasonable overrides --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2024-05-14 12:49:14 -07:00
jazelly	d71db22799	fix: skip undefined confluence pageContent (#1383 ) Refs: https://github.com/Mintplex-Labs/anything-llm/issues/1381 Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2024-05-14 10:22:13 -07:00
Predrag Stojadinović	78e3e35d27	[FEAT] Confluence Data Connector handles custom Confluence urls (#1362 ) * chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones * chore: formatting as per yarn lint	2024-05-14 10:21:04 -07:00
timothycarambat	2d215acb75	patch storage dirs for extensions	2024-05-02 14:03:10 -07:00
timothycarambat	1aa8e5766f	duplicate key (no impact)	2024-05-02 13:05:20 -07:00
Timothy Carambat	547d4859ef	Bump `openai` package to latest (#1234 ) * Bump `openai` package to latest Tested all except localai * bump LocalAI support with latest image * add deprecation notice * linting	2024-04-30 12:33:42 -07:00
Timothy Carambat	94017e2b51	bump langchain deps (#1231 ) * bump langchain deps * patch native and ollama providers remove deprecated deps --------- Co-authored-by: shatfield4 <seanhatfield5@gmail.com>	2024-04-30 12:04:24 -07:00
Sean Hatfield	348b36bf85	[FEAT] Confluence data connector (#1181 ) * WIP Confluence data connector backend * confluence data connector complete * confluence citations * fix citation for confluence * Patch confulence integration * fix Citation Icon for confluence --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2024-04-25 17:53:38 -07:00
Ken Kuang	a3b7239d05	Fix Cannot read properties of undefined (reading 'length') (#1145 ) Fix upload failed	2024-04-20 12:28:19 -07:00
Timothy Carambat	a5bb77f97a	Agent support for `@agent` default agent inside workspace chat (#1093 ) V1 of agent support via built-in `@agent` that can be invoked alongside normal workspace RAG chat.	2024-04-16 10:50:10 -07:00
Sean Hatfield	af84b01482	[FIX] GitHub repo with periods in link fix (#1084 ) fix periods in github repo links bug	2024-04-12 14:56:59 -07:00
Timothy Carambat	2c6135aa54	patch file types as plaintext (#1095 ) resolves #1089	2024-04-12 14:54:33 -07:00
Timothy Carambat	1f8ab0d245	Remove YoutubeLoader dependency (#1050 ) * WIP data connector redesign * new UI for data connectors complete * remove old data connector page/cleanup imports * cleanup of UI and imports * Remove Youtube Transcript dep and move in-house * lang pref default to en --------- Co-authored-by: shatfield4 <seanhatfield5@gmail.com>	2024-04-05 16:33:01 -07:00
timothycarambat	0b454016cf	patch comkey path to fallback	2024-04-04 10:47:26 -07:00
timothycarambat	e524afae9e	Merge branch 'master' of github.com:Mintplex-Labs/anything-llm	2024-04-02 14:30:27 -07:00
timothycarambat	117c3b2bfb	forgot epub file!	2024-04-02 14:30:20 -07:00
Timothy Carambat	4fb4aa2041	Add epub support for parsing (#1017 )	2024-04-02 14:25:52 -07:00
Timothy Carambat	752e3e22ed	Add more text file forced extensions (#1016 )	2024-04-02 14:13:11 -07:00
Timothy Carambat	f4088d9348	RSA-Signing on server<->collector communication via API (#1005 ) * WIP integrity check between processes * Implement integrity checking on document processor payloads	2024-04-01 13:56:35 -07:00
Sean Hatfield	45f50ce13c	[FIX] Update metadata tags in PDF collector script (#925 ) update title in pdf collector script to be the filename instead of metadata title	2024-03-19 18:14:34 -07:00
Timothy Carambat	0ada882991	Support external transcription providers (#909 ) * Support External Transcription providers * patch files * update docs * fix return data	2024-03-14 15:43:26 -07:00
Timothy Carambat	0f31e43fd4	bump YT metadata lib for YT api fix rot (#888 )	2024-03-11 10:57:53 -07:00
Timothy Carambat	ec90060d36	Re-map some file mimes to support text (#842 ) re-map some file mimes to support text	2024-02-29 10:05:03 -08:00
Timothy Carambat	6d18d79bb7	Generic upload fallback as text file. (#808 ) * Do not block any file upload fallback unknown/unsupported types to text if possible * reduce call for frontend * patch	2024-02-26 13:43:54 -08:00
Timothy Carambat	d89610586a	improve error messages from YT scraping (#768 ) parse & enforce URL to allow multiple URL schemas	2024-02-21 10:47:10 -08:00
Timothy Carambat	49fbd09af4	Support more plaintext filetypes (#757 ) * Add more plaintext document types org-mode, asciidoc, and reStructuredText are all text formats Signed-off-by: Christian Romney <christian.a.romney@gmail.com> * lint --------- Signed-off-by: Christian Romney <christian.a.romney@gmail.com> Co-authored-by: Christian Romney <christian.a.romney@gmail.com>	2024-02-19 10:44:01 -08:00
Timothy Carambat	d52f8aafd4	689 links in citation (#715 ) * Include links in citations force ChunkSource key to retain this information old links will be unsupported * show special icons depending on source * remove console log * reset server documents writeTo	2024-02-13 14:11:57 -08:00
Timothy Carambat	48cb8f2897	Add support to upload rawText document via api (#692 ) * Add support to upload rawText document via api * update API doc endpoint with correct textContent key * update response swagger doc	2024-02-07 15:17:32 -08:00
Sean Hatfield	288ff0d18c	fix vector cache not deleting cache after unembedding items with folders (#630 )	2024-01-22 13:03:05 -08:00
Timothy Carambat	0db6c3b2aa	Prevent private octets from link collection for self-hosted (#626 )	2024-01-19 10:49:40 -08:00
Timothy Carambat	b35feede87	570 document api return object (#608 ) * Add support for fetching single document in documents folder * Add document object to upload + support link scraping via API * hotfixes for documentation * update api docs	2024-01-16 16:04:22 -08:00
Timothy Carambat	1563a1b20f	Strict link protocol validation (#577 )	2024-01-11 12:29:00 -08:00
Timothy Carambat	58971e8b30	Build & Publish AnythingLLM for ARM64 and x86 (#549 ) * Update build process to support multi-platform builds Bump @lancedb/vectordb to 0.1.19 for ARM&AMD compatibility Patch puppeteer on ARM builds because of broken chromium resolves #539 resolves #548 --------- Co-authored-by: shatfield4 <seanhatfield5@gmail.com>	2024-01-08 16:15:01 -08:00
Francisco Bischoff	990a2e85bf	devcontainer v1 (#297 ) Implement support for GitHub codespaces and VSCode devcontainers --------- Co-authored-by: timothycarambat <rambat1010@gmail.com> Co-authored-by: Sean Hatfield <seanhatfield5@gmail.com>	2024-01-08 15:31:06 -08:00
timothycarambat	26549df6a9	touchup linting	2023-12-27 13:28:37 -08:00
timothycarambat	daadad3859	hoist var in extensions	2023-12-20 19:41:16 -08:00
Timothy Carambat	f2fadd6d2e	Add placeholder collector ENV file (#476 ) resolves #474	2023-12-19 13:27:09 -08:00
Timothy Carambat	ecf4295537	Add ability to grab youtube transcripts via doc processor (#470 ) * Add ability to grab youtube transcripts via doc processor * dynamic imports swap out Github for Youtube in placeholder text	2023-12-18 17:17:26 -08:00
Timothy Carambat	452582489e	GitHub loader extension + extension support v1 (#469 ) * feat: implement github repo loading fix: purge of folders fix: rendering of sub-files * noshow delete on custom-documents * Add API key support because of rate limits * WIP for frontend of data connectors * wip * Add frontend form for GitHub repo data connector * remove console.logs block custom-documents from being deleted * remove _meta unused arg * Add support for ignore pathing in request Ignore path input via tagging * Update hint	2023-12-18 15:48:02 -08:00
timothycarambat	d2e3506bb9	fix: transition on LLM and embedding screen linting	2023-12-15 12:40:11 -08:00
Timothy Carambat	61db981017	feat: Embed on-instance Whisper model for audio/mp4 transcribing (#449 ) * feat: Embed on-instance Whisper model for audio/mp4 transcribing resolves #329 * additional logging * add placeholder for tmp folder in collector storage Add cleanup of hotdir and tmp on collector boot to prevent hanging files split loading of model and file conversion into concurrency * update README * update model size * update supported filetypes	2023-12-15 11:20:13 -08:00
Timothy Carambat	719521c307	Document Processor v2 (#442 ) * wip: init refactor of document processor to JS * add NodeJs PDF support * wip: partity with python processor feat: add pptx support * fix: forgot files * Remove python scripts totally * wip:update docker to boot new collector * add package.json support * update dockerfile for new build * update gitignore and linting * add more protections on file lookup * update package.json * test build * update docker commands to use cap-add=SYS_ADMIN so web scraper can run update all scripts to reflect this remove docker build for branch	2023-12-14 15:14:56 -08:00
Timothy Carambat	da0cec7aa2	patch: remove unidecode as it was transliterating non-latin chars (#434 ) resolves #298	2023-12-13 11:54:55 -08:00
Timothy Carambat	ce9233c258	feat: enable HTML uploads from UI (#422 ) resolves #418	2023-12-11 14:40:33 -08:00
timothycarambat	b583aa74fd	remove prints	2023-11-16 17:17:52 -08:00
Sean Hatfield	7edfccaf9a	Adding url uploads to document picker (#375 ) * WIP adding url uploads to document picker * fix manual script for uploading url to custom-documents * fix metadata for url scraping * wip url parsing * update how async link scraping works * docker-compose defaults added no autocomplete on URLs --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2023-11-16 17:15:01 -08:00
Sean Hatfield	f40309cfdb	Add id to all metadata to prevent errors in frontend document picker (#378 ) add id to all metadata to prevent errors in frontend docuemnt picker Co-authored-by: timothycarambat <rambat1010@gmail.com>	2023-11-16 14:36:26 -08:00
timothycarambat	1e3d82e184	patch collector script	2023-11-16 10:25:23 -08:00
timothycarambat	c5dc68633b	patch link scrape tool schema	2023-11-14 16:41:39 -08:00
Timothy Carambat	5441717294	normalize parser struct for all file types (#321 )	2023-11-01 16:44:02 -07:00
Francisco Bischoff	26dba59249	mbox parsing improvements v1 (#308 ) * mbox parsing improvements v1 * autobots roll out!	2023-10-30 11:57:33 -07:00
Timothy Carambat	18798c5b64	prevent deletion of documents not in hotdir via director traversal (#258 ) resolves #257	2023-09-29 11:04:47 -07:00
Timothy Carambat	a505928934	Display better error messages from document processor (#243 ) pass messages to frontend on success/failure resolves #242	2023-09-18 16:50:20 -07:00
Timothy Carambat	3e78476739	Franzbischoff document improvements (#241 ) * cosmetic changes to be compatible to hadolint * common configuration for most editors until better plugins comes up * Changes on PDF metadata, using PyMuPDF (faster and more compatible) * small changes on other file ingestions in order to try to keep the fields equal * Lint, review, and review * fixed unknown chars * Use PyMuPDF for pdf loading for 200% speed increase linting --------- Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com> Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>	2023-09-18 16:21:37 -07:00
Melroy van den Berg	16b8330fbf	Update requirements.txt (#185 ) Upgrade fake-useragent to latest version (v1.2.1). Disclaimer: I'm the package maintainer.	2023-08-14 14:38:14 -07:00
Timothy Carambat	b42493c6de	Split large PDFS into subfolder in documents (#176 ) append time value to folder name to prevent duplicate uploads	2023-08-03 18:57:50 -07:00
AntonioCiolino	31e5db7490	Twitter Feature (#134 ) * . * twitter feature update * Key validation and operation	2023-07-06 14:05:50 -07:00
Timothy Carambat	d7315b0e53	be able to parse relative and FQDN links from root reliabily (#138 )	2023-07-05 14:40:54 -07:00
mplawner	3efe55a720	Added mbox support (#106 ) * Update filetypes.py Added mbox format * Created new file Added support for mbox files as used by many email services, including Google Takeout's Gmail archive. * Update filetypes.py * Update as_mbox.py	2023-06-25 18:11:05 -07:00
AntonioCiolino	a52b0ae655	Updated Link scraper to avoid NoneType error. (#90 ) * Enable web scraping based on a urtl and a simple filter. * ignore yarn * Updated Link scraper to avoid NoneType error.	2023-06-19 12:07:26 -07:00
frasergr	4079020de0	dockerfile cleanup; enforce text LF line endings (#81 )	2023-06-17 20:18:01 -07:00
AntonioCiolino	e7ba028497	Enable web scraping based on a urtl and a simple filter. (#73 )	2023-06-16 17:29:11 -07:00
timothycarambat	81b2159329	reorder docs	2023-06-16 17:26:42 -07:00
Timothy Carambat	c4eb46ca19	Upload and process documents via UI + document processor in docker image (#65 ) * implement dnd uploader show file upload progress write files to hotdirector build simple flaskAPI to process files one off * move document processor calls to util build out dockerfile to run both procs at the same time update UI to check for document processor before upload * disable pragma update on boot * dockerfile changes * add filetype restrictions based on python app support response and show rejected files in the UI * cleanup * stub migrations on boot to prevent exit condition * update CF template for AWS deploy	2023-06-16 16:01:27 -07:00
AntonioCiolino	537a6a91d2	Update __HOTDIR__.md (#70 ) fixed typo for text.	2023-06-16 11:17:18 -07:00
Skid Vis	4118c9dcf3	Blocks images in sitemaps from being parsed. (#56 ) * Adds ability to import sitemaps to include a website * adds example sitemap url * adds filter to bypass common image formats * moves filetype ignoring to sitemap script	2023-06-14 23:00:03 -07:00
Skid Vis	bd32f97a21	Adds ability to import sitemaps to include a website (#51 ) * Adds ability to import sitemaps to include a website * adds example sitemap url	2023-06-14 11:04:17 -07:00
frasergr	9f33b3dfcb	Docker support (#34 ) * Updates for Linux for frontend/server * frontend/server docker * updated Dockerfile for deps related to node vectordb * updates for collector in docker * docker deps for ODT processing * ignore another collector dir * storage mount improvements; run as UID * fix pypandoc version typo * permissions fixes	2023-06-13 11:26:11 -07:00
Fabio	d954d7a3d5	Fix pypandoc issue in requirements.txt (#23 ) Co-authored-by: Carvalho, Fabio <Fabio_Carvalho@comcast.com>	2023-06-12 11:21:11 -07:00
timothycarambat	728eaff773	fix typo	2023-06-09 11:23:53 -07:00
timothycarambat	27c58541bd	inital commit ⚡	2023-06-03 19:28:07 -07:00

1 2 3

121 Commits