Evaluation of an abstraction layer for next-generation analytic engines and scalable databases
===
__Description:__
`The interest in using Big Data solutions based on the Hadoop ecosystem is constantly growing in HEP community in particular for use cases related to data analytics and data warehousing. At the same time, OLTP databases continue to be critical parts of many systems, notably online acquisition systems that need to store and retrieve metadata and configurations in real-time (eg: Accelerator control). Many of these users currently have a traditional based database (eg: Oracle) for most their goal is to move these parts out to next-generation systems that allow them to scale-out so that they can profit more of their data. The objective is the evaluation of existing abstraction engines that would allow us to provide a long term service while reducing the risk of the very fast changing pace of these technologies`
__Justification:__ `Long term support for analytic services and scalable databases: many of the open source projects in the "big data" space change rapidly, we wish to provide an intermediate layer for stability of the developments`
---
## Plan of work
Note: This plan may change depending on the evolution of the project
* **Exploration (April)**
* Explore the industry offering (mainly Cloud) on SQL on top of anything
* who is using what - we need to understand the trands
* **Installation and evaluation of candidate technologies (May - August 2019)**
* Candidate technologies to be tested
* Presto
* Apache Hive (with LLAP)
* Apache Drill
* Spark-SQL thrift server?
* Apache Druid?
* Dremio?
* Iceberg?
* HAWQ?-
* Apache Trafodion
* Apache Kylin
* Tajo?
* others from the 'Exploration' phase
* Current backends of interest:
* HDFS (JSON, Parquet, ORC)
* HBase (probably via Apache Phoenix)
* Kafka (JSON, Avro serialization)
* RDBMS (Oracle, PostgreS, MySQL)
* Key aspects to test
* JDBC support
* Evaluate feasibility of write and read paths
* typically we can read, can we also write via SQL at scale
* Support of complex structures
* can we have nested types and easily access it
* Performance
* run typicall benchmarks (eg. TPC-DS..)
* real queries on top of imported prod data
* Multitenency
* can it be easily used in multi-user evironment?
* access control list exists and can be graned per user/e-group?
* is there a way for adjusting available resources per user/e-group?
* Community
* is the project backed up by a big community of contributors? Is actively developed?
* Deployment complexity
* is it easy to deploy it within our infrasturcutre (Hadoop + Puppet + egoups)
* The tests will be performed on Hadalytic cluster
* ***Present the result (Internally and at the ITDB tech forum, September - October 2019)***
* **Implement a prototype service of the chosen technology stack on our infrastructure (October 2019 - March 2020), this includes**
* puppetization, integration with current production Hadoop installations, implementation of ACLs
---
## 1. Exploration phase
* Selected technologies overview:
| | Presto | Hive | Impala | Spark SQL | Drill | HAWQ | Dremio | Phoenix |
| -------- | -------- | -------- | ------ | --------- | ------ | ------ |----- | ----|
| Supported backends | multiple | HDFS, HBase, S3, Alluxio | HDFS, HBase | multiple | HDFS, S3, HBase and many others | HDFS, Hbase| NoSQL, RDBMS, HDFS, S3, HBase(deprecated) etc |HBase |
| Opensource | yes | yes | yes | yes | yes | yes| yes | yes|
| JDBC | yes | yes | yes | yes | yes | yes | yes | yes|
| Multitenancy | yes, with queries priority | yes, priority could be handled by YARN | yes, no priority without CM | yes on YARN queues | yes | yes but max 1280 concurrent connections on the master/or priority based on YARN resources | Yes and priority based on cpu/user/group/query type | yes|
| Community | Facebook, 300 contributors | Hortonworks, widely adopted, 200 contributors | Cloudera, 100 contributors | Databricks, Cloudera, 1400 contributors| 120 contributors | 70 , Pivotal | 120 contributors | 80 contributors
| Read and Write | Not for all | Yes but limited, only with ORC | Yes, back weak| Yes | Doesn't support insert/update | Supported | Supported | Supported|
| Nested structures |yes | yes | limited | yes | yes | yes/limited | yes | yes |
| Performance | ? | ? | ? | ? | ? | ? | ? | ? |
| Deployment complexity | relatively simple | easy, integrated with yarn | easy | easy integration with spark | straightforward | relatively simple| medium complexity | simple |
| Used in the cloud? | Facebook, Airbnb, Netflix etc. | Hortonworks, Facebook, etc. | JPMorgan Chase, Cloudera, etc. | UC Berkeley, Amazon, etc. | Erikson, Intuit, etc. | VMWare | Microsoft, UBS, TransUnion | Dell, Huawei, Ebay, Intel etc|
|Fault-Tolerance| No but failure rate < 0.2% | Yes, mid-query fault tolerance | No | Yes | Yes | Yes | ? | yes |
|Data Formats Supported |Many including ORC, Parquet | Supports AVRO, ORC, Parquet and others | Parquet, AVRO and others | Many since it is compatible with Hive | JSON, Hive Formats, Parquet, Kafka Streams | Parquet, Avro, HBase formats | Parquet, Hive Formats | ORC Parquet (?)|
* Foreseen stack:
* Storage backend - storage like HDFS, Kafka, HBase and RDBMS
* Data access layer - software that is able to interact with storage using SQL (Hive, Phoenix, SparkSQL etc)
* Top level Abstraction - software that is able to talk with multiple data access layers or storage backends directly using SQL (note: some data access layers can fit in this category as well)
| Top Level Abstraction | Data access layer | Storage backend |
| --------------------- | ---------------- | -------------------------------|
| Presto | Apache Phoenix | HDFS |
| SparkSQL | Hive | HBase |
| Apache Calcite | Impala | Phoenix |
| Apache Drill | JDBC | RDBMS (Oracle, MySQL, Postgres)|
| Dremio? | Druid | Kafka |
| | Apache HAWQ | Cassandra? |
| | | |
| | | |
* Technologies discarded because of close source, cloud-native or licensed
* Google BigQuery/Dremel - Interactive, multi-user, columnar storage format, query execution trees, nested data, linear scalability, Fault-Tolerant, Data on google cloud or Google File System and Web UI for querying --- Paid per query or in flat rate.
* Amazon Redshift Spectrum - Fast Performance, quick scaling, low deploying complexity, multiple data formats supported, S3 compatible --- Paid per terabyte per year
* Oracle Big Data SQL - Data analysis across Hadoop, Kafka, RDBMS and NoSQL, Storage Index Technology, multiple data formats supported --- Paid per service
* IBM Big SQL - Multiple concurrent queries with good performance, interactive queries, Supports Hive, HBase, Spark --- Paid per month
* Jethro - Performance is empowered by its own indexing technology, high availability, scalability, easily integrated, HDFS, S3 support and BI tools such as Tableau, Qlik.
---
* Interesting technologies to be evaluated as data access layers?
* Apache Ignite - an in-memory computing platform that is durable, strongly consistent, and highly available with powerful SQL, key-value, and processing APIs.
* Apache Kylin - OLAP
* Apache Calcite - Dynamic data management framework
* Apache Tajo - Robust big data relational and distributed data warehouse system for Apache Hadoop, distributed SQL query processing engine, scalable ad-hoc queries, online aggregation, ETL, low latency, fault-tolerant, dynamic scheduling, Hive metastore support, JDBC, user-defined functions, JSON, ORC, Parquet, Nested structures.
---
* Similar abstraction layers from enterpises
* Janus - UBER, Unified Query Execution Service [Link](https://cwiki.apache.org/confluence/display/Hive/Presentations?preview=/27362054/61329033/Hive%20at%20Uber.pptx)
* BigSQL Architecture - Similar to our desired architecture
<img src="https://media.springernature.com/lw785/springer-static/image/art%3A10.1186%2Fs40537-018-0146-3/MediaObjects/40537_2018_146_Fig4_HTML.png" width="350" height="400" align="middle" >
---
### TODO
* Make a sort list of the candidate technologies for the next phase
* for data access layer
* generic sql layer
### Short list of technologies
##### Generic SQL
* Presto
* SQL on many things, the most complete solution
* to be checked: usage of JDBC, connectors for Oracle and Phoenix (not yet in master)
* Apache Calcite
* SQL parser and optimizer, used by Hive, Phoenix, Drill Druid etc, good candidate for JDBC proxy front-end
* https://www.slideshare.net/JordanHalterman/introduction-to-apache-calcite
* Dremio
* BI solution on anything, uses Drill
* SparkSQL
* SQL, we need to check interactivity with JDBC?
* Apache Drill
* omni-SQL solution, pluggable
##### Data access layer
* Phoenix (HBase)
* performance, transactions, secondary indexes
* Hive (LLAP)
* performance, transactions, ORC
##### Anything else???
## 2. Evaluation
### Goals
* Drill vs Presto
### DRILL
* tested features
* supports hdfs impersonation
* supports pam based authentication (via jpam) - quite nice feature
* supports kerberos-based authentication
* extracts nxcals parquet and json with nested records
* integrated in tables system state infos and stats
* hbase connector
* phoenix plugin using jdbc
* hive connector - does suppport ORC via Hive
* performance of analytics on Parquet (in contrast to Presto)
* works with predicates.
* column family is seen as a column in drill, columns are displayed as json.
* No impresonation support
* ACLs on storage plugins seems to not exist
* LIMITATIONS:
* Doesn't support analytic SQL functions like: Groupping Sets, intersect, cartesian joins, rollup, except
* Unlike Parquet files it cannot read ORC files directly from HDFS
* It cannot create tables in Hive, HBase mapped to ORC files
* Altough it supports reading Hive tables mapped to ORC files there is a bug leading to malformation of ORC files if read multiple times
* Phoenix Plugin for Drill cannot query without specifying a limit on the query due to an underlying size estimation of the query output
* to be tested
* Oracle connector
* kafka connector
* mysql connector
* tables and partitions (HDFS)
* joins across multiple storages
* SSL on webui
* drill on YARN
### PRESTO
* tested features:
* phoenix connector
* Performance of Parquet vs ORC
* Performance in contrast to Drill
* Hive connector
* SSL on webui
* supports kerberos-based authentication
* to be tested
* Oracle connector
* kafka connector
* mysql connector
* tables and partitions (HDFS)
* joins across multiple storages
* jdbc driver
* LIMITATIONS:
* Doesn't support updates/deletes on Phoenix tables through the corresponding connector
## 3. Implementation