<img src="https://codimd.web.cern.ch/uploads/upload_f7c3b7e8c07d5e5907f1bcf36360335d.png" style="background:none; border:none; box-shadow:none;"> <p style="font-size: 60%; color: white; margin-top: 0px;">www.cern.ch</p> --- ### CERN Search as a Service &nbsp; #### Pablo Panero --- ### Search as a Service 1. Current search 2. Requirements of the solution 3. Proposed solution / Current status 4. How is it built 4.1. Elasticsearch 4.2. PostrgeSQL 4.3. OpenShift 4.4. Changes from Invenio modules 6. Crawler/Indexer --- ### Current search - MS Search - Black box - Does not cover all our use cases - Maintenance / Operational costs - Vendor lock-in - Need a more ad-hoc solution --- ### Requirements of the solution: The "Why" - Customize user documents' schemas - Customize user search patterns - Improve: - Search and indexing performance/time (~real-time) - User centered BI capabilities (reporting, statistics...) - Maintenance, disaster recovery, etc. - Access control granularity --- ### Proposed solution / Current status: The "What" - Multiple instances: Indico, Web services and Twiki. Soon to come EDMS and Drupal. - One global "CERN Search" instance that will gather all the public/allowed content --- ### Proposed solution / Current status: The "What" * Two different RESTful API - Search API: CRUD documents - Binary API: Parsing binary documents (PDF, Word, etc.) * [Service documentation](http://cernsearchdocs.web.cern.ch/cernsearchdocs/) --- ### Search as a Service: The "How" * Build a platform to offer search as a **service** - Elasticsearch - PostgreSQL - Redis - RabbitMQ - OpenShift - Python (Invenio) [Gitlab repo link](https://gitlab.cern.ch/webservices/cern-search/cern-search-rest-api) --- ### Elasticsearch - Central Service - Handling prefix via folders ([PR invenio-search](https://github.com/inveniosoftware/invenio-search/pull/145)) - e.g. cernsearch-test-index_v1.0.0.json in ``cernsearch-test`` folder - Cluster is shared for all instances - Admin user ``cernsearch-*`` - User with CRUD per alias (e.g. ``cernsearch-test-*``) - Some instances share a set of indexes (EDMS) - Creation of users done via SNOW Ticket (not automated) --- ### PostgreSQL - DBoD database. Currently measuring load. - Future plans: - Get DBoD instance - Automate DB creation via APB - Can be binded to OpenShift templates --- ### Openshift - Following ``invenio/template-openshift`` - Master project for image push from Gitlab - Children projects per instance - Route is twicked: - Reencrypt - Service self-signed cert - For OAuth - Future: Use cern-sso-proxy [Gitlab repo link](https://gitlab.cern.ch/webservices/cern-search/cern-search-rest-api-openshift) --- ### Changes to invenio modules - Invenio-oauthclient: Override ``cern_authorized_singup_handler`` to add athorization via egroups. - Invenio-accounts: Tried to hide some components from the user ([Issue](https://github.com/inveniosoftware/invenio-accounts/issues/261)) - Redirect to ``/account/settings/applications`` via Nginx - Disable ``ACCOUNTS_SESSION_ACTIVITY_ENABLED``. If not ``sessions:login_listener`` crashes. --- ### Changes to invenio modules - Invenio-records-rest: Override permission factory, PR for list_permission. - Authenticates the service (client) - Egroup based - Admin user - CRUD: C at index level, RUD at document level. - List is done via ``ES filter`` (Subclass ``RecordsSearch``) --- ### Crawler/Indexer - For Web Services websites - Using scrappy - Communication using ActiveMQ (Central Service) [Crawler Gitlab repo](https://gitlab.cern.ch/webservices/cern-search/web-crawler) [Indexer Gitlab repo](https://gitlab.cern.ch/webservices/cern-search/web-indexer) --- ## Thanks to all for the help! --- # Q&A? ---
{"title":"Invenio Dev Forum - CERN Search as a Service","slideOptions":{"theme":"moon","transition":"slide"}}