Sung-Kyeong Kim, Hyungwoo Park, Jong-Bae Kim

It is common to use BI/DW(Business Intelligence/Data Warehouse) as a system for integrating and analyzing various data and providing it to users. Such BI/DW has been built on the basis of traditional RDBMS. However, as the number of attempts to analyze more data increases, traditional RDBMS-based data processing has become more expensive or even impossible. As an alternative to solve this problem, Hadoop can be considered as a framework for distributing and distributing data simply because Hadoop is capable of processing large amounts of data, but it is difficult to obtain faster and higher performance results. Fortunately, in recent years, technologies have been developed to overcome this problem, including Apache Tez, Stinger, Presto, Impala, and Drill. These technologies are distributed data processing engines based on In-memory, which are advanced in their basic form using Hadoop. However, the performance of these distributed data processing engines has not been studied yet. Therefore, in this study, we study the characteristics of distributed data processing engines and experiment with engines that can obtain appropriate results. Measure the processing time with emphasis on query performance, and check the advantages and disadvantages directly compared with existing commercial databases. The objective is to compare direct performance with existing commercial databases to identify the best alternative.

Volume 11 | 05-Special Issue

Pages: 273-280