Hadoop Ecosystem & Components

Spark, Pig and Hive are three of the best-known Apache Hadoop projects. Each is used to create applications to process Hadoop data. While there are a lot of articles and discussions about whether Spark, Hive or Pig is better, in practice many Pharma Organizations do not use only a single one because each is optimized for specific functions.

Spark in Pharma Commercial Applications

Spark is both a programming model and a computing model. It provides a gateway to in-memory computing for Hadoop, which is a big reason for its popularity and wide adoption in Pharma. Spark provides an alternative to MapReduce that enables workloads to execute in memory, instead of on disk. By using in-memory computing, Spark workloads typically run between 10 and 100 times faster compared to disk execution- an important consideration for Pharma commercial requirements.

Spark can be used independently of Hadoop. However, it is used most commonly with Hadoop as an alternative to MapReduce for data processing. Spark can easily coexist with MapReduce and with other ecosystem components that perform other tasks. Spark is also popular in pharma because it supports SQL, which helps overcome a shortcoming in core Hadoop technology. The Spark programming environment works interactively with Scala, Python, and R shells. It has been used for data extract / transform / load (ETL) operations, stream processing, machine learning development and with the Apache GraphX API for graph computation and display. Spark can run on a variety of Hadoop and non-Hadoop clusters, including Amazon S3.

Hive in Pharma Commercial Applications

Hive is data warehousing software that addresses how data is structured and queried in distributed Hadoop clusters. Hive is also a very popular development environment used in Pharma to write queries for data in the Hadoop environment. It provides tools for ETL operations and brings some SQL-like capabilities to the environment. Hive is a declarative language that is used to develop applications for the Hadoop environment, however it does not support real-time queries.This is not a limitation since many of the applications in commercial Pharma require batch operations.

Hive has several components, including:

  • HCatalog : Helps data processing tools read and write data on the grid. It supports MapReduce and Pig.

  • WebHCat : Lets you use an HTTP / REST interface to run MapReduce, Yarn, Pig, and Hive jobs.

  • HiveQL : Hive’s query language intended as a way for SQL developers to easily work in Hadoop. It is similar to SQL, which is a big plus for Pharma applications and helps both structure and query data in distributed Hadoop clusters.

However, Hive does not allow row-level updates or support real-time queries and it is not intended for OLTP workloads as mentioned earlier. Many consider Hive to be much more effective for processing structured data than unstructured data.

Pig in Pharma Commercial Applications

Pig is a procedural language for developing parallel processing applications for large data sets in the Hadoop environment. Pig is an alternative to Java programming for MapReduce and automatically generates MapReduce functions. Pig includes Pig Latin, which is a scripting language. Pig translates Pig Latin scripts into MapReduce, which can then run on YARN and process data in the HDFS cluster. Pig is popular because it automates some of the complexity in MapReduce development.

Pig is commonly used for complex use cases that require multiple data operations. It is more of a processing language than a query language. Pig helps develop applications that aggregate and sort data and supports multiple inputs and exports. It is highly customizable, because users can write their own functions using their preferred scripting language. Ruby, Python and even Java are all supported. Thus, Pig has been a popular option for developers that are familiar with those languages but not with MapReduce. However, SQL developers may find Hive easier to learn.