RDDs vs DataFrames in Apache Spark

Apache Spark:

Apache Spark is a general-purpose & lightning fast cluster computing system. It provides a high-level API like Java, Scala, Python and R. It is a tool for running spark applications and it is 100 times faster than Hadoop and 10 times faster than accessing data from disk.

Necessity of Apache Spark:

In the industry world , every one needed a general purpose cluster computing tools , such as

MapReduce(It is limited to batch processing).

Storm(It is limited to stream processing).

Impala(It is limited to interactive processing).

Neo4j(It is limited to graph processing).

So, here every one is handling single process only. But in Apache Apark , it provides real-time stream processing,interactive processing,graph processing,in-memeory processing as well as batch procesing with very fast speed, ease of use and standard interface.

Components of Apache Spark;

  • Spark Core
  • Spaerk Sql
  • Spark streaming
  • Mlib
  • Graphx

RDDs – Resilient Distributed Datasets:

Iit is the fundamental unit of data in spark, which is didtributed collection of elements across cluster nodes and can perform parallel operations.

RDDs are immutable but can generate new RDD by transforming existing RDD.

There are two ways to create RDDs:

Parallelized Collections:

It is created by invoking parallelize method in the driver program.

External Datasets:

It can be created by calling textfile method. This method takes an URI of the file and reda it as a collections of lines.


