Bigdata Training Center in Chennai

Bigdata training center in chennai

Background and languages

There are four MapReduce-based HLQL presented in this paper that have emerged out of the MR programing model. They also provide abstractions on the top of Hadoop framework. These abstractions are used to reduce the amount of low-level difficulties required for typical tasks and to translate queries into native MR jobs. Moreover, they allow developers to write programs using MapReduce-based HLQL abstractions that can be compiled into native MR jobs. These languages provide several operators, so developers can develop their own functions for reading, writing, and processing data. MapReduce-based HLQL are easy to be scripted, modified, and understood. Their relationship with Hadoop is shown in Fig. 1. Bigdata training center in chennai

Big data training in chennai
Fig. 1

MapReduce-based HLQL architecture and their position on top of MR

Hive: a data warehousing over Hadoop

Hive is a data warehouse infrastructure, built on the top of Hadoop framework that is developed by Facebook [20]. It provides a simple query language called HiveQL that supports queries expressed in a SQL-like declarative query language. Hive queries are compiled and translated into MR jobs that are executed on Hadoop. Hive provides a SQL-like, called Hive query language (HiveQL) for querying data stored in a Hadoop [42]. Having SQL-like features, HiveQL provides several functions and operations like group by, joins, aggregation etc. In other words, it provides an easy data summarization, ad-hoc querying and analysis of large volumes of data. Bigdata training center in chennai @Dataz

Life Cycle of HiveQL

The Hive architecture presented in the Fig. 2 is mainly composed of four main components. The first component is the external interface that consists of sub-component: command line (CLI), web user interface, application-programming interface (API) shown either as JDBC or ODBC [20]. The next one is the driver manager, the life cycle of HiveQL statements during compilation and execution that receives the queries and creates a session handle [23]. The third component is the compiler invoked by the driver upon receiving HiveQL queries. It translates those statements for generating an execution plan. The fourth one is the metastore which is the system catalog for Hive. It performs the validation of the relational schema or query. All other components of Hive interact with the metastore [20].

Bigdata training center in chennai
Fig. 2

Hive system architecture

JAQL: a JSON query language to MR

JAQL [18] is a functional scripting language with data model based on JavaScript Object Notation Language (JSON) [43]. JAQL facilitates parallel processing by translating high-level queries into low-level ones [44]. It is a dataflow language that manipulates semi-structured data using JSON values. It is able to exploit massive parallelism in a transparent manner using Hadoop framework. It started as an open-source project at Google, but IBM took the latest releases as primary data processing language [518]. Bigdata training center in chennai @Dataz

JAQL Transforms

JAQL is extendable with integration point for various data sources like local and distributed file systems, NoSQL (HBase) and relational databases. When comparing JAQL with developing dataflows directly with MR framework, JAQL have exhibited similar advantages in relational database systems. It provides a proven abstraction for data management systems. At the point of running, JAQL transforms the parsed statement into equivalent optimized statement. The optimized query can be transformed back to JAQL code. The latter is useful for debugging. Moreover, JAQL provides SQL with native JAQL functions that can parameterize and package any JAQL logic. Furthermore, JAQL can be extended with UDF (user defined functions), written in JAQL itself. However, it is possible to be used as a general-purpose programming language, JAQL is the foundation for Big SQL presented in the following subsection. The Fig. 3illustrates the architecture system of JAQL [18].

Big data training institute in chennai
Fig. 3

The architecture system of JAQL.

Big SQL: native SQL access on the top of MR

Hive provides HiveQL [20], a SQL-like, but not SQL. It has some limitations in query access for Hadoop at the level of the data types: no varchar, no decimal and others. HiveQL have not join support, and despite the JDBC/ODBC driver limitations, all queries are executed in MR jobs. All those limitations require a native SQL access for Hadoop, namely Big SQL. Bigdata training center in chennai @Dataz

Big SQL [1945] is MapReduce-based HLQL designed for providing native SQL for querying data managed by Hadoop, and developed mainly by IBM [19]. Big SQL provides massively parallel processing SQL that can deploy directly on the Hadoop Distributed File System (HDFS) [30]. It is able to use a low-latency parallel execution processing that access directly on the Hadoop data natively for reading and writing SQL queries. Big SQL is able to run on the top of Hadoop and to translate all queries to native MR jobs. It supports queries expressed in native SQL declarative language, Ansi-SQL. These queries are compiled into native MR jobs. The following architecture diagram shows how Big SQL fits with the Hadoop ecosystem.


Figure 4 illustrates the architecture of Big SQL and how it fits the Hadoop ecosystem. It supports JDBC/ODBC driver access from Linux and Windows platforms. Big SQL uses HCatalog (metastore) of Hbase for data access and the Hive storage engines to read/write data. In addition, it provides the ability to create virtual tables because the data is synthesized via JAQL scripts. Big SQL provides its own HBase storage handler due to its capacity to execute all queries in parallel execution through MR processing model. By its Ansi-SQL language, Big SQL provides direct access for low-latency queries. The Big SQL engine supports joins, unions, grouping, common table expressions, and other familiar SQL expressions.

big data trianing academy
Fig. 4

Big SQL architecture

Pig: a high-level data flow language for Hadoop

Pig [21] is a high-level data flow language for data transformation which used to analyze massive datasets and represent them as data flows. The language used for expressing these data flows is Pig Latin. This language is an abstraction of the MR programming model which makes it a HLQL constructed on the top of Hadoop. It includes many traditional data operations (sort, join, filter, etc.), as well as the ability for programmers to develop their own functions for accessing, processing, and analyzing data [23]. Pig provides an engine for executing data flows in parallel manner using Hadoop framework. The architecture of Pig is shown in Fig. 5. It shows that Pig Latin scripts are firstly handled by the Parser which checks the syntax and instance of the script. The output of the parser is a logical plan, a collection of vertices where each vertex executes a fragment of the script. The parser is also a representation of the Pig Latin statements and logical operators. This logical plan is compiled by MR compiler to submit Pig Latin scripts, as native MR jobs, to Hadoop job manager for execution.

Article Source : Springer Open



Leave a Reply

Working Hours

  • Monday9am - 6pm
  • Tuesday9am - 6pm
  • Wednesday9am - 6pm
  • Thursday9am - 6pm
  • Friday9am - 6pm
  • SaturdayClosed
  • SundayClosed
Latest Posts

Big Data training Academy in chennai
data science course in chennai
Wanna attend a demo class?

We are glad that you preferred to schedule a demo class. Please fill our short form and one of our friendly team members will contact you back.


Demo Class