mirror of
https://github.com/Stirling-Tools/Stirling-PDF.git
synced 2024-11-05 15:30:11 +01:00
52 lines
2.0 KiB
Markdown
52 lines
2.0 KiB
Markdown
# OCR Language Packs and Setup
|
|
|
|
This document provides instructions on how to add additional language packs for the OCR tab in Stirling-PDF, both inside and outside of Docker.
|
|
|
|
## How does the OCR Work
|
|
Stirling-PDF uses OCRmyPDF which in turn uses tesseract for its text recognition.
|
|
All credit goes to them for this awesome work!
|
|
|
|
## Language Packs
|
|
|
|
Tesseract OCR supports a variety of languages. You can find additional language packs in the Tesseract GitHub repositories:
|
|
|
|
- [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast): These language packs are smaller and faster to load, but may provide lower recognition accuracy.
|
|
- [tessdata](https://github.com/tesseract-ocr/tessdata): These language packs are larger and provide better recognition accuracy, but may take longer to load.
|
|
|
|
Depending on your requirements, you can choose the appropriate language pack for your use case. By default Stirling-PDF uses the tessdata_fast eng but this can be replaced.
|
|
|
|
### Installing Language Packs
|
|
|
|
1. Download the desired language pack(s) by selecting the `.traineddata` file(s) for the language(s) you need.
|
|
2. Place the `.traineddata` files in the Tesseract tessdata directory: `/usr/share/tesseract-ocr/4.00/tessdata`
|
|
|
|
# DO NOT REMOVE EXISTING ENG.TRAINEDDATA, ITS REQUIRED.
|
|
|
|
#### Docker
|
|
|
|
If you are using Docker, you need to expose the Tesseract tessdata directory as a volume in order to use the additional language packs.
|
|
#### Docker Compose
|
|
Modify your `docker-compose.yml` file to include the following volume configuration:
|
|
|
|
|
|
```yaml
|
|
services:
|
|
your_service_name:
|
|
image: your_docker_image_name
|
|
volumes:
|
|
- /location/of/trainingData:/usr/share/tesseract-ocr/4.00/tessdata
|
|
```
|
|
|
|
|
|
#### Docker run
|
|
Add the following to your existing docker run command
|
|
```bash
|
|
-v /location/of/trainingData:/usr/share/tesseract-ocr/4.00/tessdata
|
|
```
|
|
|
|
#### Non-Docker
|
|
If you are not using Docker, you need to install the OCR components, including the ocrmypdf app.
|
|
You can see [OCRmyPDF install guide](https://ocrmypdf.readthedocs.io/en/latest/installation.html)
|
|
|
|
|