Using repoze.catalog

Goals

  • Index and and search for content using repoze.catalog, a Python indexing system tool based on ZODB.

Warning

Caveat: this will likely only work on systems that have C compilation tools installed (XCode, Linux) or on Windows systems. If you can’t get repoze.catalog installed properly you may need to pair up with someone who can.

Objectives

  • Install repoze.catalog.
  • Index the title and content attributes of content we add to the system into fulltext indices.
  • Search for, and find, content we’ve added to the system using fulltext queries.

Steps

  1. easy_install repoze.catalog

  2. mkdir catalog

  3. Copy the following into zodb/application.py:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    from wsgiref.simple_server import make_server
    
    from pyramid.config import Configurator
    from pyramid_zodbconn import get_connection
    
    from resources import bootstrap
    
    
    def root_factory(request):
        conn = get_connection(request)
        return bootstrap(conn.root())
    
    
    def main():
        settings = {"zodbconn.uri": "file://Data.fs"}
        config = Configurator(root_factory=root_factory, settings=settings)
        config.include("pyramid_zodbconn")
        config.include("pyramid_tm")
        config.add_static_view('static', 'deform:static')
        config.scan("views")
        app = config.make_wsgi_app()
        return app
    
    
    if __name__ == '__main__':
        app = main()
        server = make_server(host='0.0.0.0', port=8080, app=app)
        server.serve_forever()
    
  4. Copy the following into zodb/views.py:

      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    from pyramid.httpexceptions import HTTPFound
    from pyramid.view import view_config
    from pyramid.traversal import resource_path
    from pyramid.traversal import find_resource
    from pyramid.renderers import render_to_response
    
    import colander
    from deform import Form
    from deform.widget import TextAreaWidget
    
    from resources import Folder
    from resources import Document
    
    
    class FolderSchema(colander.Schema):
        title = colander.SchemaNode(colander.String())
    
    
    class DocumentSchema(colander.Schema):
        title = colander.SchemaNode(colander.String())
        content = colander.SchemaNode(colander.String(), widget=TextAreaWidget())
    
    class SearchSchema(colander.Schema):
        term = colander.SchemaNode(colander.String())
    
    class ProjectorViews(object):
        def __init__(self, context, request):
            self.context = context
            self.request = request
            self.root = self.request.root
            self.catalog = self.root.catalog
            self.document_map = self.root.document_map
    
        @view_config(renderer="templates/folder_view.pt")
        def folder_view(self):
            schema = SearchSchema()
            form = Form(schema, buttons=('submit',))
            if 'submit' in self.request.POST:
                term = self.request.POST['term']
                query = "'%s' in title or '%s' in content" % (term, term)
                num, results = self.catalog.query(query)
                results = [self.document_map.address_for_docid(result)
                           for result in results]
                results = [find_resource(self.root, result)
                          for result in results]
                values = {'num': num,
                          'results':results,
                          'request':self.request,
                          'context':self.context,
                          'term':term}
                return render_to_response('templates/search.pt', values)
            return {"search_form": form.render()}
    
        @view_config(name="add_folder", context=Folder, renderer="templates/form.pt")
        def add_folder(self):
            schema = FolderSchema()
            form = Form(schema, buttons=('submit',))
            if 'submit' in self.request.POST:
                # Make a new Folder
                title = self.request.POST['title']
                doc_id = self.document_map.new_docid()
                name = "folder%s" % doc_id
                new_folder = Folder(title)
                new_folder.__name__ = name
                new_folder.__parent__ = self.context
                self.context[name] = new_folder
                # map object path to catalog id
                path = resource_path(new_folder)
                self.document_map.add(path, doc_id) 
                # index new folder
                self.catalog.index_doc(doc_id, new_folder)
                # Redirect to the new folder
                url = self.request.resource_url(new_folder)
                return HTTPFound(location=url)
            return {"form": form.render()}
    
        @view_config(name="add_document", context=Folder, renderer="templates/form.pt")
        def add_document(self):
            schema = DocumentSchema()
            form = Form(schema, buttons=('submit',))
            if 'submit' in self.request.POST:
                # Make a new Document
                title = self.request.POST['title']
                content = self.request.POST['content']
                doc_id = self.document_map.new_docid()
                name = "document%s" % doc_id
                new_document = Document(title, content)
                new_document.__name__ = name
                new_document.__parent__ = self.context
                self.context[name] = new_document
                # map object path to catalog id
                path = resource_path(new_document)
                self.document_map.add(path, doc_id) 
                # index new folder
                self.catalog.index_doc(doc_id, new_document)
                # Redirect to the new document
                url = self.request.resource_url(new_document)
                return HTTPFound(location=url)
            return {"form": form.render()}
    
        @view_config(renderer="templates/document_view.pt",
                     context=Document)
        def document_view(self):
            return {}
    
  5. Copy the following into zodb/resources.py:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    from persistent import Persistent
    from persistent.mapping import PersistentMapping
    
    from repoze.catalog.indexes.text import CatalogTextIndex
    from repoze.catalog.catalog import Catalog
    from repoze.catalog.document import DocumentMap
    
    
    class Folder(PersistentMapping):
        def __init__(self, title):
            super(Folder, self).__init__()
            self.title = title
    
    
    class SiteFolder(Folder):
        __name__ = None
        __parent__ = None
    
    
    class Document(Persistent):
        def __init__(self, title, content):
            self.title = title
            self.content = content
    
    
    def bootstrap(zodb_root):
        if not 'projector' in zodb_root:
            # add site folder
            root = SiteFolder('Projector Site')
            zodb_root['projector'] = root
            # add catalog and document map
            catalog = Catalog()
            catalog['title'] = CatalogTextIndex('title')
            catalog['content'] = CatalogTextIndex('content')
            root.catalog = catalog
            document_map = DocumentMap()
            root.document_map = document_map
        return zodb_root['projector']
    
  6. $ python application.py

  7. Open http://127.0.0.1:8080/ in your browser.

  8. Add folders and documents.

Extra Credit

  • Add another attribute to documents and folders named age (an integer) and use a repoze.catalog.FieldIndex to index and search the age of new documents. See http://docs.repoze.org/catalog/
  • Change the query = `` line in folder_view to not care about what's in ``title (instead, only care about what’s in content).
  • Unindex a document.

Analysis

We made no changes to application.py.

resources.py

Note the imports of catalog and index from repoze.catalog.

We create a catalog and two text indexes for title and content attributes.

We add the catalog inside the site folder. We also add a document map, which helps us map actual content to catalog ids.

views.py

On the add_folder and add_content views, we now index the document and add it to the document map. We use the content ittem’s path on the site to make the map.

To obtain a nice docid for the catalog, we use document_map.new_docid().

The path is obtained using the pyramid.traversal.resource_path() call.

After that we can index with catalog.index_doc().

The search view makes a query for all content with the search term either in the title or the content of all catalogued items.

The [results] dance afterwards is to get the actual objects from the doc_id via the document map.

Note the use of render_to_response to use the search template and not the one configured for this view.

Discussion

  • repoze.catalog uses ZODB under the hood but isn’t only for applications that use ZODB for business data storage. Can be used like Lucene or Xapian.
  • query value 'foo' in title or 'foo' in content is a “CQE” (catalog query expression). This is a declarative query system, not unlike SQL (but less expressive).

Table Of Contents