|
|||
|
Improve your data management with long-term archiving solutions
Peter Copley explains why long-term archiving policies are critical to companies in the oil and gas sector and how they should be implemented There are some very good reasons to retain data. These include increasing regulatory demands, a need to be able to access information on decision-making processes used in past projects, and the ability to protect against events that can lead to data loss. At the same time, as political and economic situations change, previously appraised projects might become more attractive. As technology changes, there may also be competitive advantages in being able to re-appraise data without having to reacquire it. On the other hand, there are also compelling reasons to dispose of data. Some of it might be useless and some of it might have to be destroyed to meet the requirements of data protection regulations. You might have relinquished legal ownership of the data, want to decrease the amount included in regular back up cycles, or need to reduce data storage and management costs. The first step, therefore, is to define your company’s archiving policy. Define an archiving policy A lot of technical work within the oil and gas industry is project based, often related to a geographic area or an engineering project. Consider, for example, an exploration project. It may be worked on for three or more years and then mothballed. During the project there would have been many different disciplines both processing and interpreting data. They would be producing results from both technical and office applications. To create a meaningful archive of such a project you need to consider how to comprehensively capture all these results, maintain the internal referential integrity of data and add as much descriptive information as possible for subsequent querying. Do you know which information is valuable? In deciding what you need to archive, you need to consult with each of the domain experts to understand what should be archived. These are the people that really understand the relative importance of the component data. They will tell you what data must be included and what can be safely ignored. When should you archive? Should data be archived at the end of a project only, or also at important milestones during it? Intermediate archives can provide rollback functionality in the case of particular problems, for example project corruption. How long should you keep each archive? You may not have a choice as this could be governed by one of today’s many compliance regulations. Should you keep an archive forever (and how long is that)? Can you predict that data will become obsolete? Consider, for example, changes in the applications that produced the data in the first place. How are you going to capture all the relevant data and maintain the consistency of cross-references? The project data may reside on multiple file systems and it may be referenced by application databases. There could also be data from multiple operating systems, for example technical information on Unix, Linux and Windows machines - as well as reports and spreadsheets from Windows. To begin with, there are a number of steps that have to be taken for all applications in each department. Find out what their project data structures comprise of and research how they should be archived. Then identify any application-specific utility provided by the vendor that performs an archive backup. Finally, find out what MetaData sufficiently describes the project datasets and research how this could be captured – preferably automatically. Features of an archiving solution Try to have one central archiving solution for all the application (and unstructured) data in your organisation. Maintain a detailed online index of all archives with their associated metadata so that in ten year’s time you will be able to rapidly search the index for relevant information. You might find that the MetaData tells you everything you need to know without having to restore any of the actual project data. If you have relinquished ownership of the data, the index will identify all components that need to be deleted. For each application, implement project oriented methods for automatic data capture, automatic MetaData capture and manual MetaData capture. Some applications have their own internal utilities that are designed to capture project data in a manner that maintains consistency and ease of restore. Where possible, your archiving solution should harness these utilities to provide an archive that can be stored and indexed in the central archive. Integrate the archiving solution to the chosen storage solution transparently so that its complexity is irrelevant. Such integration should be synchronised with the storage solution so that archivers will only need to be aware of any failures and can take successes for granted. Implement rules that can automatically exclude the project data that has no value to a long-term archive. This effectively allows you to clean-up projects as you archive. Avoid archiving, storing and restoring useless data. Devise standard naming conventions and pre-load these names so that they can be selected during the manual MetaData capture. You will avoid the age-old problem of inconsistent free text MetaData input that makes it impossible to reliably query the index. If possible, capture the GeoSpatial characteristics of the project as you archive. This will allow you to visually correlate archives against live data and quickly see all the data that might be held for a specific area. Implement verification passes within the archiving process so that you can be sure that the data in the archive is an exact copy of the data on disk. Implement security on all objects within the archive so that data is available to users on an as-needs basis only. Finally, maintain auditable records that show all transactions on the archive. Which media? There are numerous different archive storage options available, including disk, tape, optical media, virtual tape, content addressed storage devices, and archiving clusters. Hardware vendors will, naturally, highlight the advantages of their own solutions and the disadvantages of their competitors’. So there are a number of important considerations at this point. For example, do you need to hedge your bets by making two copies of each archived data set and should you write them to different media types? Another important question is that of media live versus data life. If media life is less than data life, then consider how best to migrate to ‘next’ media. Also consider media capacity: the lower the capacity the more fragmented datasets may become. Then there are the fallover mechanisms: how do you recover from a catastrophic media failure? And do you need a write once read many (WORM) capability? Regulations may insist on this. Do not forget the true total cost of ownership, either. This is a complex calculation, but you should consider initial purchase or lease costs, media cost, management costs, upgrade and migration costs, footprint and power consumption. If you need to delete data, for example when ownership is relinquished, how easy is it to destroy all copies of the data – and what effect does this deletion have on data that you want to retain and that coexists on the same media? Similarly, what level of deletion is possible on the chosen media and can it proved that deletion has actually taken place?
Make sure that you are producing a meaningful archive and that all the data and MetaData is being captured as requested. Restore projects and make sure that the applications can use the data and that referential integrity has been maintained. Also, get end-users to buy in to your system. If they trust the system, they are more likely to entrust 'their' data to it and to let you start taking projects offline. What are the alternatives? Buying more disks and keeping everything online is a very popular strategy at the moment, but over time it does tend to compound data management issues. While storage is getting cheaper all the time, the costs of managing this storage are going the other way. So it is likely that increasing data volumes will negate any advantage gained in reducing storage costs. At the end of the day, the disk still needs to be backed up or mirrored for disaster recovery anyway. Another alternative is to harness the current backup solution. These tend to be file system oriented and particularly geared towards short-term data retention. You could implement homegrown scripts and adhoc backups such as zip and tar. The problem here, however, is that each department will probably have its own methods and repositories. There will always be problems maintaining such a diverse set of such scripts end methods, especially when the departmental expert decides to move on. Then there are file system tiered storage solutions that take advantage of cheaper storage such as SATA and tape. This is not really archiving for the long-term and tends to be file system based. Implementing tiered storage allows you to keep your primary storage, which is expensive to buy and manage, available for business-critical data that is of immediate value to the organisation. Data of lesser immediate value can be moved down to lower tiers of storage, yet still be available when required. Finally there is project-oriented tiered storage. As previously discussed, much of the data used is project oriented and it makes sense to address storage tiering at a project level. For example, you may want to consider software that can identify an application project that has not been accessed in the last six months and move the complete project dataset from its various locations on primary storage down onto a lower tier of storage. This should be done in such a way that if the project suddenly becomes active again then it is accessible purely via an end user access. Enigma Data Solutions Ltd is a leading provider of rapid-access, near-line information storage solutions to the petroleum industry and other industries facing intensive data storage challenges. |
|||
