MapReduce (MR) in Bid data
Evaluation of high-level query languages based on MapReduce in Big Data. Big data training in chennai at Dataz. big data programs in chennai @Dataz
MapReduce (MR) is a criterion of Big Data processing model with parallel and distributed large datasets. This model knows difficult problems related to low-level and batch nature of MR that gives rise to an abstraction layer on the top of MR. Therefore; several High-Level MapReduce Query Languages built on the top of MR provide more abstract query languages and extend the MR programming model. These High-Level MapReduce Query Languages remove the burden of MR programming away from the developers and make a soft migration of existing competences with SQL skills to Big Data. big data programs in chennai @Dataz
This paper investigates the very used—common High-Level MapReduce Query Languages built directly on the top of MR that translate queries into executable native MR jobs. It evaluates the performance of the four presented High-Level MapReduce Query Languages: JAQL, Hive, Big SQL and Pig, with regards to their insightful perspectives and ease of programming.
The baseline metrics reported are increasing input size, scale-out number of nodes and controlling number of reducers. The experimental results study the technical advantages and limitations of each High-Level MapReduce Query Languages. Finally, the paper provides a summary for developers to choose the High-Level MapReduce Query Languages which fulfill their needs and interests.
- High Level MapReduce Query Languages
- Big SQL
- Big Data
- Performance comparison
Since it was presented by Google in 2004, MapReduce (MR)  has been emerged as a popular framework for Big Data processing model in cluster environment and cloud computing . It has become a key of success for processing, analyzing and managing large data sets with some number of implementations including the open-source Hadoop framework [3, 4]. MR has many interesting qualities, highly noticed in its design and plainness in writing programs. It has only two functions, known as Map and Reduce, written by developer to process key-value data pairs. Even though, MR is very simple to understand its principle and basic concepts but it is hard to develop, optimize, and maintain its functions especially in large-scale projects . big data programs in chennai @Dataz
MR requires approaching any problem in terms of key-value pairs where each pair can be independently computed. Also, coding efficient MapReduce programs, mainly in Java, was non-trivial for those who interested to build large-scale projects even though their programming level. This meant that many operations need multiple inputs/outputs, both simple and complex, that were very hard to achieve without wasting programming efforts and time. Furthermore, MR knows several limitations coming from its batch nature to handle streaming data [6, 7]. In addition, it is unsuitable for many operations with multiple inputs like Join operation [6, 8]. Therefore, the difficulty related to low-level programming of MR gives rise to high-level query languages (HLQL) based on MR, an abstraction layer constructed directly on the top of MR, designed to translate queries and scripts into executable native MR jobs.
High-Level MapReduce Query Languages (MapReduce-based HLQL) are built directly on the top of Hadoop MR to facilitate using MR as the low-level programming model . They outline the absence of support that MR provides for complex dataflows and they provide explicit support for multiple data sources. Moreover, the choice of the Big Data processing model has a crucial role to play in the implementation of HLQL . Several Big Data Processing model have been suggested in combined with MR, such as Resilient Distributed Dataset (RDD) and Bulk Synchronous Processing (BSP) [10, 11]. These two models can be combined with Hadoop but can not depend on MR. big data programs in chennai @Dataz
However, other HLQL based on RDD or BSP as Big Data programming model which offers an abstract model of parallel architectures and algorithms based on memory or processor computing. In fact, RDD shows a weak evaluation because of RDD data transformations . RDD’s data can be recomputed if they have lost in failure because it avoids data replication . HLQL based on RDD do many iterative computations on the same data which necessities a lot of memory for keeping the data . big data programs in chennai @Dataz
HLQL based on RDD
The use of the main memory by HLQL based on RDD affects the cost of the infrastructure because memory resources are very expensive than disk. Data continue to increase, and HLQL based on RDD needs more expensive infrastructure compared to MapReduce-based HLQL . In other words, the BSP model shows the same issues presented by RDD but in term of processor-memory pairs. In addition, BSP model is a sequence of super-steps and each super-step produces in three phases . In each phase, data computed by processor from memory.
Therefore, BSP needs to minimize the computation time of each phase partitions. Many interactions are requested between other processors and the synchronization between processors is also a potential source of errors in HLQL based on BSP . big data programs in chennai @Dataz
MR Processing Model
In the light of the previous limitations mentioned in these processing models in terms of the exhaustive need for expensive memory and the need for successive data transformations, we are interested, in this paper, with the MR processing model . There are many advantages in this model, such as it processes data on disk without requiring many transformations on data. MR processing model is based on a simple programming model and plainness in writing programs. It is cost-effective and flexible Big Data processing model which has access to various new sources of applications [15, 17]. Therefore, we investigate MapReduce-based HLQL that built directly on the top of MR, which translate all queries and scripts to be executed into native MR jobs. big data programs in chennai @Dataz
Their queries are automatically compiled to translate equivalent native MR jobs for execution, while other HLQL do not make the native translation into jobs. These MapReduce-based HLQL provide more abstract query languages, extending the MR programming model. They remove the burden of MR programming away from the developers and make a soft migration of existing competences with SQL skills to Big Data environment. JAQL, Big SQL, Hive and Pig are all the very used languages built on the top of MR to translate their queries into native MR jobs, named respectively JAQL , Ansi-SQL , HiveQL  and Pig Latin . big data programs in chennai @Dataz
Four MapReduce-based HLQL
The four MapReduce-based HLQL presented in this paper have built-in support for data partitioning, parallel execution and random access of data. In this paper, we choose the four MapReduce-based HLQL because they offer the developers the opportunity to write small and easy programs, which are equivalent to MR programs. All these languages compile and execute their queries into native MR jobs, as well as they built on the top of MR [5, 19]. The developer can write programs in high-level other than writing in low-level MR programs. big data programs in chennai @Dataz
Furthermore, our study has a large impact in the MapReduce-based HLQL development communities, in terms of describing technical limitations and advantages, presenting results and providing recommendations for each MapReduce-based HLQL regarding their features, ease of programming and performance. The essential metrics used in our work are: increasing input size, scale-out number of nodes and controlling number of Reducer tasks. These metrics are the appropriate and canonical benchmarks used in the literature review of MR and HLQL [5, 22, 23, 24]. big data programs in chennai @Dataz
Moreover, to demonstrate how much shorter are the different queries of each MapReduce-based HLQL, we investigate specifically the concise language to give insight into programming languages perspectives, such as: ease of programming with the average ratio compared with MR, ease of configuration to link with Hadoop, and execution environment [5, 9]. Moreover, we make a comparison summary for developers to facilitate their choice of each MapReduce-based HLQL to fulfill their needs and interests.
The rest of the paper is organized as follows: After presenting literature review and related works in “Related work” section. In “Background and languages” section, we go through a synthetic study of these MapReduce-based HLQL in order to present their advantages, perspectives and architecture. Then, we compare the conciseness of each MapReduce-based HLQL using the source lines and instructions of code metric. In “High-level MapReduce query languages comparison” section, we further make a comparison of how expressive are the MapReduce-based HLQL, to determine whether or not that MapReduce-based HLQL pay a performance penalty for providing more abstract languages. In “Results and discussion” section, for each benchmark metric, we take a direct implementation using basis performance of MR, which allows to assess and evaluate the overhead of each MapReduce-based HLQL. Finally, “Conclusion” section concludes this paper. big data programs in chennai @Dataz
Articles_Source : Springer Open