Amazon big data ecosystem has many applications in the pharma commercial and scientific environments. Some of the services that we actively use:
S3 is a network file system that has similarities to Hadoop’s HDFS. A big difference is that S3 does not require installation of Hadoop, namenode definitions, datanodes, etc. It is quite simple. The only requirements are defining the S3 point name and specifying the size. S3 has its own API for storage, modifications and file readouts from S3. Whats more map reduce can utilize the data inS3 .
EMR stands for Elastic Map Reduce. It can read / write data from / to S3 or from / to any storage options that Amazon provides e.g. Dynamo DB. It runs on top of Amazon EC2 on demand which ends up saving money since there isn’t always a need for “continuously on” EMR. One often quotes an example of New York Times of how they converted all their TIFF formats to PDF formats in less than 24 hours; this was a beneficiary of this service
Amazon Redshift is Amazon’s answer to data warehousing. With lightning fast speeds and its columnar architecture, it is one of the preferred solutions for the commercial data warehousing use case that the Pharma Industry often requires, especially given that much of the data residing in the commercial warehouses tends to be structured. However one must not commit the mistake of using traditional RDBM data models and working methods on Redshift architecture which requires careful usage of distributions and keys for optimizing performance
A real world evidence (RWE) platform commonly requires a data lake containing data types. This includes metadata to identify the source and ownership of the data sets etc.
At the center of nearly every RWE platform is a data lake that houses different data types. It also stores the related metadata to identify where each piece of data came from, who owned it, etc. Analytics engines integrate the relevant streaming (e.g., wearables), structured (e.g., claims data), and unstructured (e.g., notes in electronic health records) data.
It is not atypical to see Pharma Companies needing to work with a variety of sources including (but not limited to) the following:
Streaming data from wearables / devices
Structured data from patient claims
Unstructured data from EHRs / EMRs
The data lake solutions typically do not require converting data to predefined schemas. It can be used for ad-hoc analytics for quick exploration and discovery of insights
Some examples of the type of processes that can benefit from a data lake are:
Get fast access to clinical images, claims data etc for a given patient.
Bridge / map incoming data with existing data in your Real World Evidence data lake, for any particular patient or program
Select pre-defined phases in patient longitudinal studies
We executed a lake, backed by Amazon S3 and found that we could quickly scale to any data size. Security was addressed via IAM controls and policies thereby allowing us to secure the private information that is common within a Real World Evidence platform