<img src="https://codimd.web.cern.ch/uploads/upload_f7c3b7e8c07d5e5907f1bcf36360335d.png" style="background:none; border:none; box-shadow:none;">
<p style="font-size: 60%; color: white; margin-top: 0px;">www.cern.ch</p>
---
### CERN Search as a Service
#### Pablo Panero
---
### Search as a Service
1. Current search
2. Requirements of the solution
3. Proposed solution / Current status
4. How is it built
4.1. Elasticsearch
4.2. PostrgeSQL
4.3. OpenShift
4.4. Changes from Invenio modules
6. Crawler/Indexer
---
### Current search
- MS Search
- Black box
- Does not cover all our use cases
- Maintenance / Operational costs
- Vendor lock-in
- Need a more ad-hoc solution
---
### Requirements of the solution: The "Why"
- Customize user documents' schemas
- Customize user search patterns
- Improve:
- Search and indexing performance/time (~real-time)
- User centered BI capabilities (reporting, statistics...)
- Maintenance, disaster recovery, etc.
- Access control granularity
---
### Proposed solution / Current status: The "What"
- Multiple instances: Indico, Web services and Twiki. Soon to come EDMS and Drupal.
- One global "CERN Search" instance that will gather all the public/allowed content
---
### Proposed solution / Current status: The "What"
* Two different RESTful API
- Search API: CRUD documents
- Binary API: Parsing binary documents (PDF, Word, etc.)
* [Service documentation](http://cernsearchdocs.web.cern.ch/cernsearchdocs/)
---
### Search as a Service: The "How"
* Build a platform to offer search as a **service**
- Elasticsearch
- PostgreSQL
- Redis
- RabbitMQ
- OpenShift
- Python (Invenio)
[Gitlab repo link](https://gitlab.cern.ch/webservices/cern-search/cern-search-rest-api)
---
### Elasticsearch
- Central Service
- Handling prefix via folders ([PR invenio-search](https://github.com/inveniosoftware/invenio-search/pull/145))
- e.g. cernsearch-test-index_v1.0.0.json in ``cernsearch-test`` folder
- Cluster is shared for all instances
- Admin user ``cernsearch-*``
- User with CRUD per alias (e.g. ``cernsearch-test-*``)
- Some instances share a set of indexes (EDMS)
- Creation of users done via SNOW Ticket (not automated)
---
### PostgreSQL
- DBoD database. Currently measuring load.
- Future plans:
- Get DBoD instance
- Automate DB creation via APB
- Can be binded to OpenShift templates
---
### Openshift
- Following ``invenio/template-openshift``
- Master project for image push from Gitlab
- Children projects per instance
- Route is twicked:
- Reencrypt
- Service self-signed cert
- For OAuth
- Future: Use cern-sso-proxy
[Gitlab repo link](https://gitlab.cern.ch/webservices/cern-search/cern-search-rest-api-openshift)
---
### Changes to invenio modules
- Invenio-oauthclient: Override ``cern_authorized_singup_handler`` to add athorization via egroups.
- Invenio-accounts: Tried to hide some components from the user ([Issue](https://github.com/inveniosoftware/invenio-accounts/issues/261))
- Redirect to ``/account/settings/applications`` via Nginx
- Disable ``ACCOUNTS_SESSION_ACTIVITY_ENABLED``. If not ``sessions:login_listener`` crashes.
---
### Changes to invenio modules
- Invenio-records-rest: Override permission factory, PR for list_permission.
- Authenticates the service (client)
- Egroup based
- Admin user
- CRUD: C at index level, RUD at document level.
- List is done via ``ES filter`` (Subclass ``RecordsSearch``)
---
### Crawler/Indexer
- For Web Services websites
- Using scrappy
- Communication using ActiveMQ (Central Service)
[Crawler Gitlab repo](https://gitlab.cern.ch/webservices/cern-search/web-crawler)
[Indexer Gitlab repo](https://gitlab.cern.ch/webservices/cern-search/web-indexer)
---
## Thanks to all for the help!
---
# Q&A?
---
{"title":"Invenio Dev Forum - CERN Search as a Service","slideOptions":{"theme":"moon","transition":"slide"}}