searxng/searx/engines/recoll.py

# SPDX-License-Identifier: AGPL-3.0-or-later
""".. sidebar:: info

   - `Recoll <https://www.lesbonscomptes.com/recoll/>`_
   - `recoll-webui <https://framagit.org/medoc92/recollwebui.git>`_
   - :origin:`searx/engines/recoll.py`

Recoll_ is a desktop full-text search tool based on Xapian.  By itself Recoll_
does not offer WEB or API access, this can be achieved using recoll-webui_

Configuration
=============

You must configure the following settings:

``base_url``:
  Location where recoll-webui can be reached.

``mount_prefix``:
  Location where the file hierarchy is mounted on your *local* filesystem.

``dl_prefix``:
  Location where the file hierarchy as indexed by recoll can be reached.

``search_dir``:
  Part of the indexed file hierarchy to be search, if empty the full domain is
  searched.

Example
=======

Scenario:

#. Recoll indexes a local filesystem mounted in ``/export/documents/reference``,
#. the Recoll search interface can be reached at https://recoll.example.org/ and
#. the contents of this filesystem can be reached though https://download.example.org/reference

.. code:: yaml

   base_url: https://recoll.example.org/
   mount_prefix: /export/documents
   dl_prefix: https://download.example.org
   search_dir: ''

Implementations
===============

"""

from datetime import date, timedelta
from json import loads
from urllib.parse import urlencode, quote

# about
about = {
    "website": None,
    "wikidata_id": 'Q15735774',
    "official_api_documentation": 'https://www.lesbonscomptes.com/recoll/',
    "use_official_api": True,
    "require_api_key": False,
    "results": 'JSON',
}

# engine dependent config
paging = True
time_range_support = True

# parameters from settings.yml
base_url = None
search_dir = ''
mount_prefix = None
dl_prefix = None

# embedded
embedded_url = '<{ttype} controls height="166px" ' + 'src="{url}" type="{mtype}"></{ttype}>'


# helper functions
def get_time_range(time_range):
    sw = {'day': 1, 'week': 7, 'month': 30, 'year': 365}  # pylint: disable=invalid-name

    offset = sw.get(time_range, 0)
    if not offset:
        return ''

    return (date.today() - timedelta(days=offset)).isoformat()


# do search-request
def request(query, params):
    search_after = get_time_range(params['time_range'])
    search_url = base_url + 'json?{query}&highlight=0'
    params['url'] = search_url.format(
        query=urlencode({'query': query, 'page': params['pageno'], 'after': search_after, 'dir': search_dir})
    )

    return params


# get response from search-request
def response(resp):
    results = []

    response_json = loads(resp.text)

    if not response_json:
        return []

    for result in response_json.get('results', []):
        title = result['label']
        url = result['url'].replace('file://' + mount_prefix, dl_prefix)
        content = '{}'.format(result['snippet'])

        # append result
        item = {'url': url, 'title': title, 'content': content, 'template': 'files.html'}

        if result['size']:
            item['size'] = int(result['size'])

        for parameter in ['filename', 'abstract', 'author', 'mtype', 'time']:
            if result[parameter]:
                item[parameter] = result[parameter]

        # facilitate preview support for known mime types
        if 'mtype' in result and '/' in result['mtype']:
            (mtype, subtype) = result['mtype'].split('/')
            item['mtype'] = mtype
            item['subtype'] = subtype

            if mtype in ['audio', 'video']:
                item['embedded'] = embedded_url.format(
                    ttype=mtype, url=quote(url.encode('utf8'), '/:'), mtype=result['mtype']
                )

            if mtype in ['image'] and subtype in ['bmp', 'gif', 'jpeg', 'png']:
                item['img_src'] = url

        results.append(item)

    if 'nres' in response_json:
        results.append({'number_of_results': response_json['nres']})

    return results
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 11:31:25 +01:00			`# SPDX-License-Identifier: AGPL-3.0-or-later`
[doc] rearranges Settings & Engines docs for better readability We have built up detailed documentation of the settings and the engines over the past few years. However, this documentation was still spread over various chapters and was difficult to navigate in its entirety. This patch rearranges the Settings & Engines documentation for better readability. To review new ordered docs:: make docs.clean docs.live Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2023-06-30 18:07:02 +02:00			`""".. sidebar:: info`

			- `Recoll <https://www.lesbonscomptes.com/recoll/>`_
			- `recoll-webui <https://framagit.org/medoc92/recollwebui.git>`_
			- :origin:`searx/engines/recoll.py`

			`Recoll_ is a desktop full-text search tool based on Xapian. By itself Recoll_`
			`does not offer WEB or API access, this can be achieved using recoll-webui_`

			`Configuration`
			`=============`

			`You must configure the following settings:`

			``base_url``:
			`Location where recoll-webui can be reached.`

			``mount_prefix``:
			`Location where the file hierarchy is mounted on your local filesystem.`

			``dl_prefix``:
			`Location where the file hierarchy as indexed by recoll can be reached.`

			``search_dir``:
			`Part of the indexed file hierarchy to be search, if empty the full domain is`
			`searched.`

			`Example`
			`=======`

			`Scenario:`

			#. Recoll indexes a local filesystem mounted in ``/export/documents/reference``,
			`#. the Recoll search interface can be reached at https://recoll.example.org/ and`
			`#. the contents of this filesystem can be reached though https://download.example.org/reference`

			`.. code:: yaml`

			`base_url: https://recoll.example.org/`
			`mount_prefix: /export/documents`
			`dl_prefix: https://download.example.org`
			`search_dir: ''`

			`Implementations`
			`===============`

Add recoll engine (#2325) recoll is a local search engine based on Xapian: http://www.lesbonscomptes.com/recoll/ By itself recoll does not offer web or API access, this can be achieved using recoll-webui: https://framagit.org/medoc92/recollwebui.git This engine uses a custom 'files' result template set `base_url` to the location where recoll-webui can be reached set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain 2020-11-30 08:35:15 +01:00			`"""`

			`from datetime import date, timedelta`
			`from json import loads`
			`from urllib.parse import urlencode, quote`

[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 11:31:25 +01:00			`# about`
			`about = {`
			`"website": None,`
			`"wikidata_id": 'Q15735774',`
			`"official_api_documentation": 'https://www.lesbonscomptes.com/recoll/',`
			`"use_official_api": True,`
			`"require_api_key": False,`
			`"results": 'JSON',`
			`}`

Add recoll engine (#2325) recoll is a local search engine based on Xapian: http://www.lesbonscomptes.com/recoll/ By itself recoll does not offer web or API access, this can be achieved using recoll-webui: https://framagit.org/medoc92/recollwebui.git This engine uses a custom 'files' result template set `base_url` to the location where recoll-webui can be reached set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain 2020-11-30 08:35:15 +01:00			`# engine dependent config`
[feat] recoll: paged json support 2021-02-07 14:05:35 +01:00			`paging = True`
Add recoll engine (#2325) recoll is a local search engine based on Xapian: http://www.lesbonscomptes.com/recoll/ By itself recoll does not offer web or API access, this can be achieved using recoll-webui: https://framagit.org/medoc92/recollwebui.git This engine uses a custom 'files' result template set `base_url` to the location where recoll-webui can be reached set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain 2020-11-30 08:35:15 +01:00			`time_range_support = True`

			`# parameters from settings.yml`
			`base_url = None`
			`search_dir = ''`
			`mount_prefix = None`
			`dl_prefix = None`

			`# embedded`
[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 09:26:22 +01:00			`embedded_url = '<{ttype} controls height="166px" ' + 'src="{url}" type="{mtype}"></{ttype}>'`
Add recoll engine (#2325) recoll is a local search engine based on Xapian: http://www.lesbonscomptes.com/recoll/ By itself recoll does not offer web or API access, this can be achieved using recoll-webui: https://framagit.org/medoc92/recollwebui.git This engine uses a custom 'files' result template set `base_url` to the location where recoll-webui can be reached set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain 2020-11-30 08:35:15 +01:00

			`# helper functions`
			`def get_time_range(time_range):`
[doc] rearranges Settings & Engines docs for better readability We have built up detailed documentation of the settings and the engines over the past few years. However, this documentation was still spread over various chapters and was difficult to navigate in its entirety. This patch rearranges the Settings & Engines documentation for better readability. To review new ordered docs:: make docs.clean docs.live Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2023-06-30 18:07:02 +02:00			`sw = {'day': 1, 'week': 7, 'month': 30, 'year': 365} # pylint: disable=invalid-name`
Add recoll engine (#2325) recoll is a local search engine based on Xapian: http://www.lesbonscomptes.com/recoll/ By itself recoll does not offer web or API access, this can be achieved using recoll-webui: https://framagit.org/medoc92/recollwebui.git This engine uses a custom 'files' result template set `base_url` to the location where recoll-webui can be reached set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain 2020-11-30 08:35:15 +01:00
			`offset = sw.get(time_range, 0)`
			`if not offset:`
			`return ''`

			`return (date.today() - timedelta(days=offset)).isoformat()`


			`# do search-request`
			`def request(query, params):`
			`search_after = get_time_range(params['time_range'])`
			`search_url = base_url + 'json?{query}&highlight=0'`
[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 09:26:22 +01:00			`params['url'] = search_url.format(`
			`query=urlencode({'query': query, 'page': params['pageno'], 'after': search_after, 'dir': search_dir})`
			`)`
Add recoll engine (#2325) recoll is a local search engine based on Xapian: http://www.lesbonscomptes.com/recoll/ By itself recoll does not offer web or API access, this can be achieved using recoll-webui: https://framagit.org/medoc92/recollwebui.git This engine uses a custom 'files' result template set `base_url` to the location where recoll-webui can be reached set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain 2020-11-30 08:35:15 +01:00
			`return params`


			`# get response from search-request`
			`def response(resp):`
			`results = []`

			`response_json = loads(resp.text)`

			`if not response_json:`
			`return []`

			`for result in response_json.get('results', []):`
			`title = result['label']`
			`url = result['url'].replace('file://' + mount_prefix, dl_prefix)`
			`content = '{}'.format(result['snippet'])`

			`# append result`
[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 09:26:22 +01:00			`item = {'url': url, 'title': title, 'content': content, 'template': 'files.html'}`
Add recoll engine (#2325) recoll is a local search engine based on Xapian: http://www.lesbonscomptes.com/recoll/ By itself recoll does not offer web or API access, this can be achieved using recoll-webui: https://framagit.org/medoc92/recollwebui.git This engine uses a custom 'files' result template set `base_url` to the location where recoll-webui can be reached set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain 2020-11-30 08:35:15 +01:00
			`if result['size']:`
			`item['size'] = int(result['size'])`

			`for parameter in ['filename', 'abstract', 'author', 'mtype', 'time']:`
			`if result[parameter]:`
			`item[parameter] = result[parameter]`

			`# facilitate preview support for known mime types`
			`if 'mtype' in result and '/' in result['mtype']:`
			`(mtype, subtype) = result['mtype'].split('/')`
			`item['mtype'] = mtype`
			`item['subtype'] = subtype`

			`if mtype in ['audio', 'video']:`
			`item['embedded'] = embedded_url.format(`
[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 09:26:22 +01:00			`ttype=mtype, url=quote(url.encode('utf8'), '/:'), mtype=result['mtype']`
			`)`
Add recoll engine (#2325) recoll is a local search engine based on Xapian: http://www.lesbonscomptes.com/recoll/ By itself recoll does not offer web or API access, this can be achieved using recoll-webui: https://framagit.org/medoc92/recollwebui.git This engine uses a custom 'files' result template set `base_url` to the location where recoll-webui can be reached set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain 2020-11-30 08:35:15 +01:00
			`if mtype in ['image'] and subtype in ['bmp', 'gif', 'jpeg', 'png']:`
			`item['img_src'] = url`

			`results.append(item)`

			`if 'nres' in response_json:`
			`results.append({'number_of_results': response_json['nres']})`

			`return results`