Commit Graph

20 Commits

Author SHA1 Message Date
Sean Hatfield
b658f5012d
Support XLSX files (#2403)
* support xlsx files

* lint

* create seperate docs for each xlsx sheet

* lint

* use node-xlsx pkg for parsing xslx files

* lint

* update error handling

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-10-03 13:45:23 -07:00
Timothy Carambat
30645831a1
1959 filetype filters (#2378)
* Updated the `GitHubRepoLoader` class to use the new import syntax and adjust the `recursiveLoader` method accordingly.

* add @langchain/community to collector package.json

* fix: Improve handling of complex ignore patterns in GitLabRepoLoader

* refactor: use ignore package for simplified ignore logic

* run yarn lint

* add @langchain/community@^0.2.23

* remove unused dep
lint

---------

Co-authored-by: Emil Rofors (aider) <emirof@gmail.com>
2024-09-26 12:50:35 -07:00
Timothy Carambat
04a0fc4ec9
Remove unused deps (#1938)
* Remove unused deps

* improve dependency
2024-07-25 10:21:03 -07:00
Timothy Carambat
42235fcd8a
GitLab Hosted and Local Connector (#1932)
* Add support for GitLab repo collection as well as Github Repo collection
* Refactor for repo collectors to be more compact

---------

Co-authored-by: Emil Rofors <emirof@gmail.com>
2024-07-23 12:23:51 -07:00
Sean Hatfield
79656718b2
[FEAT] Create custom pdfloader (#1852)
* implement custom PDFLoader to remove LC dep

* remove unneeded comment

* remove pdfjs as dep and fix page splitting using pdf-parse

* linting + export rename for desktop compat

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-07-11 12:26:11 -07:00
Timothy Carambat
29c9eeaa5c
Add winston logging for production (#1811) 2024-07-03 16:39:33 -07:00
Sean Hatfield
a87014822a
[REFACTOR] Improve asPDF collector processor with pdfjs (#1791)
* WIP replace langchain pdfloader with pdfjs and add more context to each page

* remove extras from pdfjs and just replace langchain library

* remove unneeded dep

* fix console log in docs

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-07-03 14:26:48 -07:00
Timothy Carambat
98cef508a6
Feature/devcontv2 (#1622)
* Updated apt-packages source for devcontainer

Switched the devcontainer's package source to a different repository to
align with updated dependencies and package availability. The previous
source from 'rocker-org' is replaced with 'devcontainers-contrib', which
may offer more recent or relevant development tools.

* Subject: Centralize prettier ignores and refine
config

Body:
Centralized all prettier ignore rules by removing individual
`.prettierignore` files in subprojects and updating the root
`.prettierignore` to include previously ignored patterns, ensuring
consistency across the workspace. Additionally, the prettier
configuration was refined by making the file pattern for `.config.js`
files consistent and adjusting quote styles for better readability. All
lint scripts across the project were updated to respect the centralized
ignore path, enhancing maintainability.

The consolidation simplifies the process of managing ignore rules as the
project scales, ensuring developers can focus on writing code without
worrying about divergent formatting standards. These changes also align
with introducing comprehensive linting across multiple environments to
keep the codebase clean and consistent.

This adjustment is a foundational step towards a more streamlined and
unified code base, making it easier for new contributors to adhere to
established coding standards and reducing the cognitive load associated
with managing multiple configuration files across the project.

* unset package json changes

---------

Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com>
Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>
2024-06-06 12:50:42 -07:00
Timothy Carambat
547d4859ef
Bump openai package to latest (#1234)
* Bump `openai` package to latest
Tested all except localai

* bump LocalAI support with latest image

* add deprecation notice

* linting
2024-04-30 12:33:42 -07:00
Timothy Carambat
94017e2b51
bump langchain deps (#1231)
* bump langchain deps

* patch native and ollama providers remove deprecated deps

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-04-30 12:04:24 -07:00
Sean Hatfield
348b36bf85
[FEAT] Confluence data connector (#1181)
* WIP Confluence data connector backend

* confluence data connector complete

* confluence citations

* fix citation for confluence

* Patch confulence integration

* fix Citation Icon for confluence

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-04-25 17:53:38 -07:00
Timothy Carambat
1f8ab0d245
Remove YoutubeLoader dependency (#1050)
* WIP data connector redesign

* new UI for data connectors complete

* remove old data connector page/cleanup imports

* cleanup of UI and imports

* Remove Youtube Transcript dep and move in-house

* lang pref default to en

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-04-05 16:33:01 -07:00
Timothy Carambat
4fb4aa2041
Add epub support for parsing (#1017) 2024-04-02 14:25:52 -07:00
Timothy Carambat
0ada882991
Support external transcription providers (#909)
* Support External Transcription providers

* patch files

* update docs

* fix return data
2024-03-14 15:43:26 -07:00
Timothy Carambat
0f31e43fd4
bump YT metadata lib for YT api fix rot (#888) 2024-03-11 10:57:53 -07:00
Timothy Carambat
58971e8b30
Build & Publish AnythingLLM for ARM64 and x86 (#549)
* Update build process to support multi-platform builds
Bump @lancedb/vectordb to 0.1.19 for ARM&AMD compatibility
Patch puppeteer on ARM builds because of broken chromium
resolves #539
resolves #548

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-01-08 16:15:01 -08:00
Timothy Carambat
ecf4295537
Add ability to grab youtube transcripts via doc processor (#470)
* Add ability to grab youtube transcripts via doc processor

* dynamic imports
swap out Github for Youtube in placeholder text
2023-12-18 17:17:26 -08:00
Timothy Carambat
452582489e
GitHub loader extension + extension support v1 (#469)
* feat: implement github repo loading
fix: purge of folders
fix: rendering of sub-files

* noshow delete on custom-documents

* Add API key support because of rate limits

* WIP for frontend of data connectors

* wip

* Add frontend form for GitHub repo data connector

* remove console.logs
block custom-documents from being deleted

* remove _meta unused arg

* Add support for ignore pathing in request
Ignore path input via tagging

* Update hint
2023-12-18 15:48:02 -08:00
Timothy Carambat
61db981017
feat: Embed on-instance Whisper model for audio/mp4 transcribing (#449)
* feat: Embed on-instance Whisper model for audio/mp4 transcribing
resolves #329

* additional logging

* add placeholder for tmp folder in collector storage
Add cleanup of hotdir and tmp on collector boot to prevent hanging files
split loading of model and file conversion into concurrency

* update README

* update model size

* update supported filetypes
2023-12-15 11:20:13 -08:00
Timothy Carambat
719521c307
Document Processor v2 (#442)
* wip: init refactor of document processor to JS

* add NodeJs PDF support

* wip: partity with python processor
feat: add pptx support

* fix: forgot files

* Remove python scripts totally

* wip:update docker to boot new collector

* add package.json support

* update dockerfile for new build

* update gitignore and linting

* add more protections on file lookup

* update package.json

* test build

* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
2023-12-14 15:14:56 -08:00