Apache spark Training chennai

apache spark training chennai

MR Processing Model Related work

Since its emergence publication in [1], MR processing model has become a criterion of certain analytical tasks. The same authors point out an extensive increase of the number of MR processes that ran in some very large companies during its first years of emergence [25]. Afterwards, MR becomes a popular Big Data processing model for large-scale data sets running on clusters in fault tolerant manner [1, 3]. In addition to MR processing model, several Big Data processing models have been proposed, such as RDD [11] and BSP [10], which are considered as other alternative Big Data processing models of some HLQL. Fegaras et al. [26] proposed MRQL as SQL-like query language for large-scale data analysis. MRQL is a HLQL based on BSP that can evaluate their queries in four modes [27]: MR mode using Apache Hadoop [1], BSP mode using Apache Hama [28], Spark mode using Apache Spark [24], and in Flink mode using Apache Flink [29]. apache spark training chennai.

BSP Programming Model

In MRQL-to-MR mode, MRQL translates MRQL queries in physical MR algebraic operators, which optimizes and translates the algebraic form to a physical plan consisting of the physical MR operators and not into native MR jobs [26]. In [28], the authors presented Apache Hama, BSP-based model processing, which provides not only pure BSP programming model but it is also constructed on the top of HDFS [30]. Further, the work implemented in [31] presents Apache Drill as an interactive ad-hoc analysis. It is created as Apache Incubator Project [32]. This work offers better low latency SQL engine but its application tool and visualization are very limited to customization.

Apache spark training chennai @Dataz

Moreover, it is absolutely not dependent on Hadoop [31]. Apache Phoenix is used only for HBase data and compiles queries into native HBase calls, without any interaction with MR [33]. Chang et al. [34] proposed the framework HAWQ which is another HLQL, it breaks complex queries into small tasks and distributes them to massively parallel processing (MPP) query processing units for execution [23].

Hadoop framework

In [35], the authors have presented Impala, an open-source MPP SQL query engine designed specifically to leverage the flexibility and scalability of Hadoop framework. However, Impala is an intensive memory and effectively not applicable for heavy data operations like Joins because it is not possible to transmit massive data into the memory. Furthermore, fault tolerance is not supported by Impala. The authors have proposed Llama as an intermediate component for integrated resource management within YARN [36]. Therefore, it can not translate their queries into native MR jobs. Moreover, the number of Hadoop-based systems has increased significantly. These systems use another framework to process SQL-like queries. Spark Core is one of the executive systems to process SQL-like queries of Spark framework. Thus, every other functionality is built on the top of Spark Core. Apache spark training chennai @Dataz

Michael et al. [37] presented Spark SQL as SQL-like data processing built on the top of Spark Core. The presented Spark SQL processes data upon RDD abstraction which has many limitations in computations and data in-memory nature [12, 13]. The use of the main memory by Spark SQL affects more cost-effective of the infrastructure, due to memory resources which are very expensive than disk while Hadoop processes data on disk [38]. Apache spark training chennai @Dataz

HLQL based on BSP or RDD

Spark needs larger expensive infrastructure compared to Hadoop [14]. Further, developers are limited by using Spark SQL because is supported only in Spark [19, 37]. But in reality, interested companies need a generic solution that works with a lot of native SQL or like-SQL and is easily scalable to any future systems [19, 39]. As result, HLQL based on BSP or RDD show many limitations according to their processing models nature and the manner in which these HLQL based on BSP and RDD are implemented. They require more memory that is expensive and many complex computations on the data. These HLQL have a complex configuration to link with another open-source framework and do not provide easy programming. Apache spark training chennai @Dataz

Therefore, HLQL based on MR have built-in support for data partitioning, parallel execution and random access of data. They offer the opportunity to write small and easy programs, which are equivalent to MR programs. These HLQL based on MR compile their queries and scripts into executable native MR jobs, as well as they built on the top of MR, without many transformations on data. The developer can write programs in high-level other than writing in low-level MR programs.

SQL-like high-level languages

In fact, many works have been realized in MR query optimization: Pig Latin [21] and HiveQL [20] provide HLQL developed on the top of MR. They define an SQL-like high-level languages for processing large-scale analytic workloads. A performance comparison between three HLQL is conducted in the work [5], the authors show that Hive is more efficient than Pig using scaling input data size, processing units and execution time [40]. They used only two metrics to evaluate the performance of their three HLQL: JAQL, HiveQL and Pig Latin. In [21], the authors describe the challenges that have faced in developing Pig, and reported performance comparisons between Pig execution and raw MR execution.

In [41], the authors have chosen two specific languages: Pig and JAQL. They compared them regarding two metrics. Another performance analysis of high-level query languages is implemented in [22]. The authors have analyzed the performance of the three high-level query languages Pig, Hive, and JAQL based on the processing time. All existing HLQL based on MR are not included in their work and they have used only one metric. Apache spark training chennai @Dataz

JAQL, Hive and Pig

Our work differs from the previous studies in the literature by specifying MR as a Big Data processing model and adding the recent MapReduce-based HLQL, Big SQL which is native SQL query language that translates native Big SQL queries into native MR jobs. Also, we compare Big SQL with the very used MapReduce-based HLQL in their recent and stable version: JAQL, Hive and Pig. Therefore, we have added more than two metrics to evaluate the performance of the four MapReduce-based HLQL: JAQL, Hive, Pig and Big SQL. These metrics are: increasing input size, scale-out number of nodes and controlling number of Reducers. Apache spark training chennai @Dataz

Besides the previous three metrics, we add other valuable one which is the language conciseness that gives insight into language programming perspectives, such as: ease of programming with the average ratio compared with MR and ease of configuration to link with Hadoop, the execution environment. Finally, we offer a summary for developers to choose the MapReduce-based HLQL which fulfill their needs and interests

Article_Source: Springer Open

Previous-Post

Leave a Reply


Working Hours

  • Monday9am - 6pm
  • Tuesday9am - 6pm
  • Wednesday9am - 6pm
  • Thursday9am - 6pm
  • Friday9am - 6pm
  • SaturdayClosed
  • SundayClosed
Wanna attend a demo class?

We are glad that you preferred to schedule a demo class. Please fill our short form and one of our friendly team members will contact you back.


[recaptcha]

X
Demo Class