Hadoop Spark Training Chennai

High-level MapReduce query languages comparison

The four specific MapReduce-based HLQL presented in this paper establish an abstract layer for utilizing the MR programming model. We have chosen to compare JAQL, Big SQL, Hive and Pig regarding their features, ease of programming and speed of programming. To differentiate the insightful perspective of MapReduce-based HLQL, we used the WordCount benchmark of each language, which is the appropriate canonical benchmark used in the literature review of MR and MapReduce-based HLQL [5222324]. In each MapReduce-based HLQL, we investigate specifically the conciseness of each language, which gives insight into MapReduce-based HLQL perspectives and draw a comparison, on the one hand, to examine how expressive are the MapReduce-based HLQL, on the other hand, to determine whether or not MapReduce-based HLQL pay a performance penalty for providing more abstract languages. HADOOP SPARK TRAINING CHENNAI

Programming languages with MapReduce-based HLQL

The Hive query language (HiveQL) includes a subset of SQL-like and some useful extensions in this comparison. Traditional SQL characteristics such as: sub-queries, group by and aggregations, various types of joins, union all and other functions make the language very SQL-like. The structure of HiveQL in database’s notions are tables, columns and partitions. It supports all the important primitive types such as doubles, integers, strings and collection types such as lists, structs and maps. Hive comprises a query compiler, responsible for compiling HiveQL into a directed cyclic graph of MR jobs. The Hive word count query is given in Listing 1.1.

Listing 1.1. HiveQL Word Count query

1: load data inpath ‘/user/wcdirectory/’ into table myinput;

2: create table wordcount as

3: select word, count(1) as count

4: from (select explode(split(lcase(regexp_replace(line,’[\\p{punct},\\p{cntrl}]’,’’)),’‘)) as word frommyinput) words

5: group by word;

Concerning the principal characteristics of JAQL and its capability to read/write from different storage types, mainly the HDFS, local file system, HTTP and HBase. JAQL is extendable with user defined functions (UDF), either written in many popular programming languages or in JAQL itself. Its values, arguments, variables and return values, named key are JSON objects or JAQL functions. The JAQL word count is given in Listing 1.2.

Listing 1.2. JAQL Word Count query

1: $InputData = read(lines(“/user/Jaql/wcdirectory/”))

2: $InputData → expandstrSplit($, “ “)

3: $InputData → group by $w = ($) into [$w, count($)]

4: → write(seq(/user/jaql/outputJaqlWC/JAQLOutput”));

Big SQL requires creating tables and familiarizing them with data. It supports a Create Table statement and a LOAD command for moving data. Although, the fundamental syntax of Big SQL query is similar to SQL, there are some aspects of Big SQL, table creation and data loading, that probably will not. The Load command of Big SQL reads and moves data simply and directly from several relational DBMS as well as from files stored locally or in HDFS. The Big SQL word count query is given in Listing 1.3.

Listing 1.3. Big SQL WordCount query

1: create table “wctable” (“word” varchar(32768) )

2: row format delimited fields terminated by ‘;’

3: lines terminated by ‘\n’;

4: load hive data local inpath ‘/tmp/wctable.csv’

5: overwrite into table wordcount.wctable;

6: select word, count(word) as wordcount from wctable group by word;

With its language Pig Latin, Pig takes three key aspects in account [35] for its development. The first key allows developers to create programs which are easy to write, understand, and maintain. The second key is the optimization whose tasks are to transform and allow system optimizing their execution automatically. The last one is the extensibility where developers can create their own functions by the UDF. The Pig Latin word count is shown in Listing 1.4.

Listing 1.4. Pig Latin Word Count query

1: myinput = Load ‘/user/wcdirectory/’ Using TextLoader AS (line:CHARARRAY);

2: words = Foreach myinput Generate Flatten (Tokenize(Replace(Lower(Trim(line)), [\\p{Punct},\\p{Cntrl}]’,’’)));

3: grpd = Group words By $0;

4: cntd = Foreach grpd Generate $0, Count($1);

5: unmix = Order cntd BY $1 Desc, $0 ASC;

6: Store unmix Into ‘/user/unmixPIG.dat’ Using PigStorage( );

Comparative analysis of MapReduce-based HLQL

This subsection outlines the concise language of the four MapReduce-based HLQL. It gives a general insight from these languages perspectives with the average ratio compared with MR. For each benchmark, we compare the conciseness between the four MapReduce-based HLQL with MR using the source lines of code metric (Fig. 7). We further evaluate the four MapReduce-based HLQL based on the use of computational power to achieve the performance comparison (Fig. 6). HADOOP SPARK TRAINING CHENNAI @DATAZ

hadoop spark training chennai
Fig. 6

Computational power comparison

Figure 6 shows the computational power of JAQL, Big SQL, Hive and Pig, MR and MR-Merge model [46]. The principal idea of MR-Merge is to bring relational operations into parallel data processing. It can be used to implement some derived relational operators. Thus, MR-Merge is relationally complete; whereas, MR is not relationally complete. Neither Pig Latin nor HiveQL provides loop structures required to be established as Turing Complete languages. Pig Latin and HiveQL can both extend UDF. JAQL provides not only recursive function but it can be also defined as Turing Complete. Big SQL can extend UDF in the context of the “Select” list of queries or in a “where clause” to filter data.

Figure 7 illustrates the source lines of the four MapReduce-based HLQL’s code metric, with a direct implementation of Java MR, to focus on the abstract nature of these languages in order to compare the used conciseness in the WordCount, Join and web log processing benchmark. HADOOP SPARK TRAINING CHENNAI @DATAZ

Fig. 7

Source lines of code comparison of the four MapReduce-based HLQL

Programs in all four MapReduce-based HLQL (JAQL, Big SQL, Hive and Pig) are shorter than the equivalent Java MR:

  • WordCount: MR is 39 lines and all MapReduce-based HLQL are smaller than 6 lines.

  • Join: MR is 95 lines and all MapReduce-based HLQL are smaller than 8 lines.

  • Web log processing: Java is 140 lines and all MapReduce-based HLQL are smaller than 6 lines.

Developers spend more time in writing the MR manually, debugging large applications, and hence require a magnitude program size. Figure 7 highlights the abstract nature of these four languages through presenting the source lines of code for each benchmark metric which is much smaller than the equivalent Java MR. Pig and Hive are unable to check and control the number of Reducer tasks within the language syntax. They can tune this parameter to improve the runtime performance. JAQL has proved to be both the most computationally powerful language and Turing Complete. Moreover, Big SQL is the shortest high-level language compared to all MapReduce-based HLQL and MR.

Related Topic :

Leave a Reply

Working Hours

  • Monday9am - 6pm
  • Tuesday9am - 6pm
  • Wednesday9am - 6pm
  • Thursday9am - 6pm
  • Friday9am - 6pm
  • SaturdayClosed
  • SundayClosed
Latest Posts

Big Data training Academy in chennai
data science course in chennai
Wanna attend a demo class?

We are glad that you preferred to schedule a demo class. Please fill our short form and one of our friendly team members will contact you back.


Demo Class