I don’t know many children who dream of working for UPS to deliver packages. Growing up in Ohio, I would have laughed at thinking that I would manage a package-delivery company. In reality, this is what I am. My team delivers data as part of a package.
Did the founders of UPS, formerly known as the American Messenger Company, think they were building a billion dollar business when it was founded in 1907. Did they imagine they would be able to compete with the U.S.? Postal Service Many people take it for granted that packages can be delivered overnight quickly and cheaply. UPS delivers packages small and large, hot or cold. UPS has shipped many unusual and extraordinary items, some so unique that it is difficult to price the service.
There are many delivery options available to move our sensitive, private, and intricate data from our electronic warehouses to our anxious consumers. My current role shows that the problem isn’t a lack of data storage options, but choosing the right one. Can IT departments be viewed as the UPS of data. Can we create and deliver data and information we never imagined when we started?
We have discovered that data delivery is a complex issue in health care. Data must be delivered securely, accurately, and with high reliability for patient care. Research data requires high speed and performance. Large datasets must be spun on high IOPS drives. Researchers must receive the results as soon as possible. It is important to have high security and reliability when exchanging patient data.
If IT budgets were unlimited, all data could be delivered on the fastest components of network, hardware, and applications. Inova focuses first on storing, transforming, and securing patient data. This is possible because of the solid and consistent systems and architecture within our health system. This secure environment is where we collect, review, and de-identify patient information before it is moved to our research environments. Security and reliability of data are our first priorities.
Tiered disk storage is a good option, as it allows for greater data volumes and can be accessed at a lower cost. Data can be viewed as a series if temperatures: cold, medium, and hot. The cold data is large, unprocessed, and rarely accessed. These data can be stored in a cost-efficient, but long-lasting way.
We use Amazon Web Services (AWS), for most of our long term storage needs. We continue to develop Amazon S3 policies to automate the transfer of our files to longer-term Amazon Glacier classes storage. This greatly reduces the cost of on-premise storage options. Glacier storage does have a cost for moving this data. This data is rarely used by us so the cost to move it out of Glacier storage is minimal.
Medium data is accessed more often, but not every single day. Although this data is smaller than the cold, it can still be terabytes. This data must be accessible quickly and moved quickly with minimal latency. We expect faster access times than the cold tier but are willing to wait seconds before it is put into motion. This is possible with solid state drives, which are more expensive than the cold tier. This data is stored and delivered using Amazon EC2 or on-premises disks.
The hot data is smaller than medium-tier data. These data have been further refined using extract transfer load (ETL), or larger analysis by our scientific team. We expect these data to be easily moved and query quickly. This is where we expect the data and its manipulations to be completed in milliseconds so that the user can interact with it.