anything-llm/collector/scripts/watch/convert/as_text.py

import os
from slugify import slugify
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
from ...utils import tokenize

# Process all text-related documents.
def as_text(**kwargs):
  parent_dir = kwargs.get('directory', 'hotdir')
  filename = kwargs.get('filename')
  ext = kwargs.get('ext', '.txt')
  remove = kwargs.get('remove_on_complete', False)
  fullpath = f"{parent_dir}/{filename}{ext}"
  content = open(fullpath).read()

  print(f"-- Working {fullpath} --")
  data = {
    'id': guid(),
    'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
    'title': f"{filename}{ext}",
    'docAuthor': 'Unknown', # TODO: Find a better author
    'description': 'Unknown', # TODO: Find a better description
    'chunkSource': f"{filename}{ext}",
    'published': file_creation_time(fullpath),
    'wordCount': len(content),
    'pageContent': content,
    'token_count_estimate': len(tokenize(content))
  }

  write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
  move_source(parent_dir, f"{filename}{ext}", remove=remove)
  print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
inital commit ⚡ 2023-06-04 04:28:07 +02:00			`import os`
			`from slugify import slugify`
			`from ..utils import guid, file_creation_time, write_to_server_documents, move_source`
			`from ...utils import tokenize`

			`# Process all text-related documents.`
			`def as_text(**kwargs):`
			`parent_dir = kwargs.get('directory', 'hotdir')`
			`filename = kwargs.get('filename')`
			`ext = kwargs.get('ext', '.txt')`
Upload and process documents via UI + document processor in docker image (#65) * implement dnd uploader show file upload progress write files to hotdirector build simple flaskAPI to process files one off * move document processor calls to util build out dockerfile to run both procs at the same time update UI to check for document processor before upload * disable pragma update on boot * dockerfile changes * add filetype restrictions based on python app support response and show rejected files in the UI * cleanup * stub migrations on boot to prevent exit condition * update CF template for AWS deploy 2023-06-17 01:01:27 +02:00			`remove = kwargs.get('remove_on_complete', False)`
inital commit ⚡ 2023-06-04 04:28:07 +02:00			`fullpath = f"{parent_dir}/{filename}{ext}"`
			`content = open(fullpath).read()`

			`print(f"-- Working {fullpath} --")`
			`data = {`
Franzbischoff document improvements (#241) * cosmetic changes to be compatible to hadolint * common configuration for most editors until better plugins comes up * Changes on PDF metadata, using PyMuPDF (faster and more compatible) * small changes on other file ingestions in order to try to keep the fields equal * Lint, review, and review * fixed unknown chars * Use PyMuPDF for pdf loading for 200% speed increase linting --------- Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com> Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com> 2023-09-19 01:21:37 +02:00			`'id': guid(),`
inital commit ⚡ 2023-06-04 04:28:07 +02:00			`'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),`
			`'title': f"{filename}{ext}",`
Franzbischoff document improvements (#241) * cosmetic changes to be compatible to hadolint * common configuration for most editors until better plugins comes up * Changes on PDF metadata, using PyMuPDF (faster and more compatible) * small changes on other file ingestions in order to try to keep the fields equal * Lint, review, and review * fixed unknown chars * Use PyMuPDF for pdf loading for 200% speed increase linting --------- Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com> Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com> 2023-09-19 01:21:37 +02:00			`'docAuthor': 'Unknown', # TODO: Find a better author`
			`'description': 'Unknown', # TODO: Find a better description`
			`'chunkSource': f"{filename}{ext}",`
inital commit ⚡ 2023-06-04 04:28:07 +02:00			`'published': file_creation_time(fullpath),`
			`'wordCount': len(content),`
			`'pageContent': content,`
			`'token_count_estimate': len(tokenize(content))`
			`}`
Franzbischoff document improvements (#241) * cosmetic changes to be compatible to hadolint * common configuration for most editors until better plugins comes up * Changes on PDF metadata, using PyMuPDF (faster and more compatible) * small changes on other file ingestions in order to try to keep the fields equal * Lint, review, and review * fixed unknown chars * Use PyMuPDF for pdf loading for 200% speed increase linting --------- Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com> Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com> 2023-09-19 01:21:37 +02:00
inital commit ⚡ 2023-06-04 04:28:07 +02:00			`write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")`
Upload and process documents via UI + document processor in docker image (#65) * implement dnd uploader show file upload progress write files to hotdirector build simple flaskAPI to process files one off * move document processor calls to util build out dockerfile to run both procs at the same time update UI to check for document processor before upload * disable pragma update on boot * dockerfile changes * add filetype restrictions based on python app support response and show rejected files in the UI * cleanup * stub migrations on boot to prevent exit condition * update CF template for AWS deploy 2023-06-17 01:01:27 +02:00			`move_source(parent_dir, f"{filename}{ext}", remove=remove)`
Franzbischoff document improvements (#241) * cosmetic changes to be compatible to hadolint * common configuration for most editors until better plugins comes up * Changes on PDF metadata, using PyMuPDF (faster and more compatible) * small changes on other file ingestions in order to try to keep the fields equal * Lint, review, and review * fixed unknown chars * Use PyMuPDF for pdf loading for 200% speed increase linting --------- Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com> Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com> 2023-09-19 01:21:37 +02:00			`print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")`