R Integration with Hadoop

What Hadoop is?

  • Hadoop was founded by the ASF – Apache Software Foundation.
  • It is an open-source framework.
  • Hadoop is designed with the intention to store, process and interpret data in huge volumes.
  • It is written in Java programming language.
  • Hadoop is utilised for batch/offline processing and not for Online Analytical Processing (OLAP).
  • To scale up Hadoop, nodes are added to the cluster.
  • Example: Facebook, Google, Twitter, Yahoo, and LinkedIn makes use of Hadoop.

The purpose behind R and Hadoop integration:

  • Integrating R with Hadoop facilitates strong data analytics and visualization features.
  • By integrating R with Hadoop, Hadoop can be utilised to execute R code.
  • By integrating R with Hadoop, R can be utilised to access the data stored in Hadoop.

Method of R and Hadoop Integration:

Integration of R and Hadoop can be done using either of the below four methods:

  • R Hadoop
  • Hadoop Streaming
  • RHIPE
  • ORCH

R Hadoop:

The R Hadoop method is a collection of packages, i.e., the rmr package, the rhbase package, and the rhdfs package.

  • The rmr package:
    • Facilitates MapReduce functionality.
    • Allows execution of the Mapping and Reducing codes in R.
  • The rhbase package:
    • Facilitates R integration with HBASE.
    • Facilitates R database management capabilities.
  • The rhdfs package:
    • Facilitates R integration with HDFS.
    • Facilitates file management capabilities.

Hadoop Streaming:

  • Hadoop Streaming allows users to create and run jobs.
  • It is a utility.
  • Jobs can be created and run with any executable as the mapper and/or the reducer.
  • Working Hadoop jobs can be created with the streaming system.
  • Java programming language can be utilised, in this process to write two shell scripts to work in tandem.
  • R and Hadoop integration is an important toolkit for handling large data sets and statistics.

RHIPE:

  • R and Hadoop Integrated Programming Environment, or RHIPE applies working with R and Hadoop integrated programming environment.
  • RHIPE was created by Divide and Recombine.
  • RHIPE facilitates efficient analysis of a large amount of data.
  • For reading data sets in RHIPE, programming languages, such as Python, Perl, or Java can be utilised.
  • RHIPE provides several functions for HDFS and HDFS interaction.
  • The data created using RHIPE MapReduce can thus be read and saved completely.

ORCH:

  • ORCH (Oracle R Connector) is specifically utilised to operate with Big Data in Oracle appliance.
  • However, a non-Oracle framework like Hadoop can also make use of Oracle R Connector.
  • The Hadoop cluster can thus be accessed with R.
  • It also facilitates the writing of the mapping and reducing functions.
  • Manipulation of the data present in the Hadoop Distributed File System is also possible with ORCH.

 

Please follow and like us:
Content Protection by DMCA.com