Big Data Certification Course in Chennai

Big data certification courses un chenni

MapReduce-based HLQL

                             An important characteristic of all the four MapReduce-based HLQL, JAQL, Big SQL, Pig and Hive presented in this paper is that they are not limited to the core functionality of each language, which means they are all extendable by the used of UDF. These allow developers to provide more custom data formats and functions. Moreover, MapReduce-based HLQL are very important components in the Hadoop environment because they work on the top of MR as Big Data processing model. Further, MapReduce-based HLQL provide the interface to process large datasets stocked in HDFS.

However, the following table lists the important functions, comments and different perspectives of the four MapReduce-based HLQL presented in this paper. big data certification course in chennai

JAQL is a weak high-level scripting language in term of running time and Turing Complete, while the other MapReduce-based HLQL are not (Fig. 6). Based on the results achieved in this paper, JAQL does not competitive with the running time performance of Big SQL, Hive and Pig as shown in “Increasing input size metric” and “Scale-out number of nodes” sections. However, it does not change much performance by adding additional nodes, as shown in Fig. 12. JAQL shows similar query runtime performance to Hive, but the flexibility is still needed.

Furthermore, JAQL is designed to enable developers to use MR framework when it is needed, such low-level characteristics with declarative data-flows.

[ Big Data Certification Course in Chennai  @DATAZ ]

Hadoop Framework

Big SQL is the most concise language, with an average ratio with MR Java of just 4.9%, as shown in Table 1. It presents the new-generation of SQL on Hadoop framework. In fact, for the previously mentioned constructs, it is exactly native SQL. The Big SQL engine supports joins, grouping, unions, common table expressions, and other familiar SQL expressions. The Big SQL LOAD command simply read and moves data directly from several relational DBMS systems as well as from files stored locally or in HDFS. Big SQL can use Hadoop’s MR framework to process various query tasks in parallel or execute the queries locally within the Big SQL server whichever may be most appropriate for these queries. Big SQL has shown up a challenge for comparative studies like this report. In addition to the fact that these languages are comparatively new, they are moving targets when trying to build fair and relevant performance and feature comparisons.

[ Big Data Certification Course in Chennai  @DATAZ ]

Table 1

Comparative study between Big SQL, Pig, Hive and JAQL

Characteristic

MR

JAQL

Big SQL

Hive

Pig

Description

A programming model for parallel processing and generating large data sets

A data-flow processing and querying language

A HLQL designed for providing native SQL access for Hadoop

A data warehouse infrastructure for Hadoop

A high-level data flow interface for Hadoop

Language name

MapReduce

Jaql

Ansi-SQL

HiveQL

Pig Latin

Developed by

Google

IBM

IBM

Facebook

Yahoo

Type of language

Data processing paradigm

Data flow

SQL

SQL-like (presenting a declarative language)

Data flow

Evaluation

At runtime

@ runtime

At runtime

During compilation

During compilation

Supported data

Structured and unstructured data (for structured data, MR may not be as efficient as Big SQL, Hive and Pig)

JSON and semi-structured

Mostly structured

Mostly structured

Complex

Process category

Batch processing

Dataflow for JSON/batch

Dataflow system OLAP/batch

Data warehouse OLAP/batch

Dataflow/batch

User defined functions

[Extendable]

Extendable

-Extendable

{Extendable}

[Extendable]

Schema optional?

Without schema

&Yes

No, mandatory

No, mandatory

Yes

Relational complete?

No

No

{Yes}

actually Yes

Yes

Turing complete?

Yes

But, Yesdata

[Yes, when extended UDF]

Yes, when extended UDF

{Yes, when extended UDF}

Source lines of code (mean ratio with MR Java)

7.1%

4.9%

25.4%

21.1%

Join operation

Difficult (it is quite hard to perform a join operation between data sets, and very hard with multiple data sources)

Simple

Simple

[Simple]

{Simple}

Pig is a high-level data-flow language for Hadoop that allows writing MR operations by the scripting language of Pig, Pig Latin.

[ Big Data Certification Course in Chennai  @DATAZ ]

High-level Dataflow

Pig has a concise language, with an average ratio with MR Java of 21.1%, while Hive has 25.4%, as shown in Table 1. Figure 10 shows that Pig and JAQL have almost the same runtime performance when scaling input size for the Join benchmark. In “Scale-out number of nodes” section, by adding nodes, the results show that Pig takes advantage from adding nodes with 66.3% of decreasing in runtime performance (Fig. 12). Hive knows a very low change, as shown in WordCount benchmark presented in Fig. 11. Besides the above observations presented in Join benchmark and illustrated in Fig. 12, Hive is not designed for online transaction processing and offer neither real-time queries nor row level updates. It is best applicable for query batch jobs to process large sets of immutable data, like web log processing.

[ Big Data Certification Course in Chennai  @DATAZ ]

Conclusion

This paper is a comparative study of the four MapReduce-based HLQL built on the top of MR processing model. The languages under investigation are JAQL, Big SQL, Hive and Pig designed to translate their queries into native MR jobs and provide more abstract query facilities instead of using low-level MR. The baseline numerical metrics reported in this paper are: increasing input size, scale-out number of nodes, controlling number of reducers.

Moreover, the language conciseness of each MapReduce-based HLQL gives insight from programming language perspectives, such as: ease of programming and configuration to link with Hadoop, execution environment. This conciseness settles a ground for comparison that to see how expressive are the MapReduce-based HLQL, in order to determine whether or not MapReduce-based HLQL pay a performance penalty for providing more abstract languages.

In fact, Big SQL has proved to be the best solution for the problem of integrating native SQL query processing on the top of MR. Whilst Pig Latin shows its limitation that lies in its expressiveness. In most benchmarks, Pig and JAQL have almost shown the same performance when increasing input size. Even though, it has the biggest percentage in source lines of code metric comparing to others, Hive provides performance closest to Big SQL.

[ Big Data Certification Course in Chennai  @DATAZ ]

MapReduce

Finally, this report also highlights the concise nature of the presented MapReduce-based HLQL. These languages provide an abstract layer to remove the burden away from developers. The paper provides also a summary comparison for developers to choose the MapReduce-based HLQL which fulfill their needs and interests.

Articles Source : Springer Open

Related :

Leave a Reply


Working Hours

  • Monday9am - 6pm
  • Tuesday9am - 6pm
  • Wednesday9am - 6pm
  • Thursday9am - 6pm
  • Friday9am - 6pm
  • SaturdayClosed
  • SundayClosed
Latest Posts

Big Data training Academy in chennai
data science course in chennai
Wanna attend a demo class?

We are glad that you preferred to schedule a demo class. Please fill our short form and one of our friendly team members will contact you back.


[recaptcha]

X
Demo Class