|Data Loss Prevention|
Computer's are more relied upon now than ever, or more to the point the data that is contained on them. In nearly every instant the system itself can be easily repaired or replaced, but the data once lost may not be recreatable. That's why the Data Recovery Clinic stresses the importance of regular system back ups and the implementation of some preventative measures.
The chart above lists the most common reasons that data recovery would be needed for. In all cases there are steps that you the user can take to minimize your risk of data loss.
1. Natural disasters
While the least likely cause of data loss, a natural disaster can have a devastating effect on the pyhsical drive. However, Data Recovery Clinic has rescued data from fires, floods, lightening strikes and the subsequent power surges.
In instances of severe housing damage, such as scored platters from fire, water emulsion due to flood, or broken or crushed platters, the drive may become unrecoverable.
The best way to prevent data loss from a natural disaster is an off site back up. Since it is nearly impossible to predict the arrival of such an event, there should be more than one copy of the system back up kept, one onsite and one off. The type of media you back up to will depend on your system, software, and the required frequency you need to back up. Can you proceed with a day's data loss? a week's? a month's? Also be sure to check your back ups to be certain that they have properly backed up. There's nothing worse than attempting to restore data from a blank medium.
Viral infection increases at rate of nearly 200-300 new trojans, exploits and viruses every month. There are approximately 56,712 "wild" or risk posing viruses and about 105,000 total known viruses, some of which are considered non-threatening. With those numbers growing everyday, you are at an ever-increasing risk to become infected with a virus.
There are several ways to protect yourself against a viral threat:
a. Install a Firewall on your system to prevent hackers access to your data.
b. Install an anti-virus program on your system and use it regularly and scan to see if you have been infected. Many viruses will lie dormant or perform many minor alterations that can cumulatively disrupt you system works. Be sure to check for updates on a regular basis.
c. Back up and be sure to test your back ups for infection as well. There is no use in removing the virus only to restore it again form your back up.
d. Be wary of any email containing an attachment. If you don't know where it came from or what it is, then don't open it.
e. If you have contracted a "wild" virus that there is no known cure for, quarantine it to that system and contact the Data Recovery Clinic for further information and assistance.
3. Human Errors
Even in today's era of highly trained, certified, and computer literate staffing there is always room for the timelessness of accidents. Sometimes referred to as the U.S.E.R virus, human mistakes are made daily all over the world. There is not much we can do as users to prevent the intervention of Murphy's Law, except to be cautious. Here are a few things you might want to try:
a. Be aware. It sounds simple enough to say, but not so easy to perform. When transferring data, be sure it is going to the destination you had in mind. If asked "Would you like to replace the existing file" make sure you are before clicking "yes".
b. If you are even a little bit uncertain about a task you are about to carry out, make sure there is a copy of the data to restore from.
c. Take extra care when using any software that may manipulate your drives data storage, such as: partition mergers, format changes, or even disk checkers.
d. Before upgrading to a new Operating System, back up your most import files or directories in case there is a problem during the install. Keep in mind if you have a slaved data drive it may become formatted as well.
e. Never shut the system down while programs are running. The open files will more than likely become truncated and non functional.
4. Software Malfunction
Software malfunction is a nessesary evil when using a computer. Even the world's top programs cannot anticipate every error that may occur on any given program. There are still a few things you can do to lessen the risk:
a. Be sure you are using the software ONLY for its intended purpose. Mis-using a program may cause it to malfunction.
b. Using pirated copies of a program may cause the software to malfunction, resulting in a corruption of you data files.
c. Be sure that you have the proper amount of memory installed if you plan to run multiple programs simultaneously. If a program shuts down or freezes up you may lose or corrupt what you were working on.
d. Back up, Back up, Back up. A tedious task, but you will be glad you did if the software corrupts your customer data base.
5. Hardware Malfunction
The most common cause of data loss, hardware malfunction or hard drive failure, is another nessesary evil inheirent to computing. There is usually little to no warning that your drive will fail, but some steps can be taken to minimize the need for data recovery from a hard drive failure:
a. Do not stack drives on top of each other-leave space for ventilation. An over heated drive is likely to fail. Be sure to keep the computer away from heat sources and make sure it is well ventilated.
b. Purchase an UPS (Uninterruptible Power Supply) to lessen malfunction caused by power surges.
c. NEVER open the casing on a hard drive. Even the smallest grain of dust settling on the platters in the interior of the drive can cause it to fail.
If you need hard drive recovery do one of the following:
Fill out an online data recovery quote form - a representative will get back to you within an hour of submittal.
Call 212-759-0946 to speak with a representative and receive your quote over the phone. We answer our phones 24 hours a day 7 days a week.
Fill out a data recovery request form and ship us your drive. please follow any instructions on how to package and ship a hard drive.
Overview of Backup Technology
Overview of Backup Technology
This reliance on full-time availability of data means the time to backup data is shrinking, and the demands for 100% availability of important data and for frequent backups is growing. These trends are placing enormous pressure on Information Technology organizations to increase the speed of backups while reducing the degree to which they intrude on day-to-day operations. Equally important is the need to recover files quickly and efficiently. Thus scheduled backups and rapid recoveries are activities that must be predictable, stable, reliable, and fast.
Basics of Backup and Recovery Technology
Physical and Logical Backups
Database Backup Technology
Because transactions must be logged during the backup process, database performance may be degraded while on-line backups are performed. One way to backup a database that must sustain high transaction rates, is to mirror the database and perform a physical backup of the mirror. This requires first altering the database to begin backup, which establishes a quiescent database image. Then the mirror is detached so that a static image of the database is maintained on the detached mirror. The database is then altered to end backup, which allows logged transactions to be rolled forward into the tablespaces while a raw device backup of the mirror is done. When the backup is complete, the mirror is re-attached and the mirroring mechanism synchronizes the two disk images once again.
Raw Device Backups
Advances in Backup Technology
Faster Throughput Rates
Automated Backup and Recovery Management Procedures
New Approaches to On-line Backups Using Database Technology
All of these developments in database backup technology require processing power and I/O bandwidth in order to work in concert to speed the backup process. Sun's Ultra Enterprise servers provide scalable, symmetric multi-processing, scaling from one to 64 high-performance UltraSPARC processors, up to 64 GB of memory, and supporting up to 20 TB of disk storage. The advent of scalable I/O platforms such as these allows DBMSs to be configured with the optimal balance of processing power and I/O bandwidth--enabling on-line backups to proceed without impacting database performance.
Capacity planning is part science and part art. The capacity planner must account for numerous variables and virtually unlimited configuration permutations. Systems are often underconfigured and the wrong products are often selected for the job. Because installation and configuration are complex, there is much room for error. Furthermore, because there are always interrelated bottlenecks, a major aspect of capacity planning is choosing the preferred bottleneck.
The main role of the capacity planner is to choose hardware and software for efficient backup and recovery in the Datacenter. To do this, the planner must first determine the following:
Volume of data the Datacenter will be managing
Understanding the Enterprise
Total Dataset Size
Minimum amount of storage capacity required
Number of separate files. The total volume of data may be composed of a few large files or millions of small files. Certain types of data (e.g., databases) may not reside in files at all, but be built on top of raw volumes. In filesystem backups, there is often a small fixed overhead per file. The file record needs to be added to the backup database, the directory information read, and the disk needs to perform a seek to beginning of file.
Knowing the number of files also helps the planner determine the size of the backup index database retained by the backup software. On average, Sun StorEdge Enterprise NetBackup software suggests planning for 150 bytes in the database per file revision retained on media. That works out to over seven million file records per gigabyte of index database.
Average file size. By knowing the above two pieces of information, the capacity planner can calculate the average file size in the enterprise. If there is a large skew in file size distribution (e.g., many small files and a couple of very large files that throw off the average), the average may not be a good predictor of behavior. Therefore the planner must plan for slightly different performance when backing up small files versus large files.
Average directory depth. The directory structure into which the files are organized may also have an effect on the performance of the backup system. This is partly because long directory paths results in multiple seeks to the disk. Longer paths also result in larger records, because each filepath backed up is recorded in the database as a variable size entry. Therefore, longer paths tend to make the backup index databases grow faster.
Size of the Dataset that Changes
The frequency of the dataset change. The frequency of the dataset change determines the frequency for performing backups. The frequency that datasets change can widely vary. For example, some directories never change, some change only when something is upgraded, some change only at the end of the month, and some, like user mailboxes, typically change on a minute-by-minute basis. In addition, the frequency of dataset change, in part, determines the volume of data written during incremental backups, because incremental backups only save the files that have changed.
Amount of data to be backed up. The planner needs to decide whether to back up all the data or only the changed portions. While it is usually faster to save only the changed portions, it is also usually faster to restore whole directories and filesystems from full backups than from incremental ones. This is because of the restore process: restores from incremental backups need to first restore from the full backup, and then from all the incremental backups, until the latest versions of all files have been restored. This multi-step process often results in numerous tape mount requests and multiple retrieves of the same piece of data. The choice of performing full backups or incremental ones tends to be a matter of which case is most important: a regularly scheduled backup or an emergency after data has been lost on disk. While the former is done much more frequently, the latter tends to be a more time-critical situation.
Database or higher-level application data plays a special role in effective capacity planning. Unless the enterprise has relatively simple availability requirements for their data, backup will require special modules to save the data in a consistent state for restore. These modules are available for many popular database and application environments for both Sun StorEdge Enterprise NetBackup and Solstice Backup software packages.
The following are types of data the planner needs to consider. The various data types mentioned below include an example compression ratio for the DLT tape 7000 tape drive.
Text or natural language. Text or natural language tends to have a lot of redundancy, and can therefore be well compressed by both software and hardware. For example, in tests using sample English texts, the DLT tape 7000 hardware compressed the data at ratio of approximately 1.4:1.
Databases and high-level applications. Many popular database packages and application environments have corresponding backup modules for Sun StorEdge Enterprise NetBackup and Solstice Backup software packages. For example, backup modules exist for Oracle, Informix, and Sybase database packages as well as for application environments like SAP. These modules enable backing up and restoring data in a consistent state, without taking the database off-line, making it unavailable to users.
Additionally, while databases and high-level applications tend to have widely varying contents and structure, they often contain text or numeric data with a lot of redundancy. This makes them more compressible. For example, in tests with sample databases from a TPC-C benchmark, the DLT tape 7000 hardware compressed the data at a ratio of approximately 1.6:1.
Graphics. Many applications require manipulating numerous large graphical objects. The fact that graphic files tend to be larger than text files does not imply the filesystem will consist of a few large files. This is because applications create composite objects from a myriad of smaller isolated objects.
In general, graphic objects tend to be previously compressed, making further compression in hardware or software unlikely. Indeed, the nature of hardware compression algorithms often inflates files that are already optimally compressed. For example, in tests with Motion JPG data, the DLT tape 7000 hardware compression showed a compression ratio of approximately 0.93:1.
Combined file types. Data residing on network file servers and internet servers, the most common server types, is usually a mix of text, graphics, and binary files. Because these datasets often consist of many small files, the capacity planner must also evaluate system performance. These mixed file types compress well. For example, in tests with files from network file servers and internet servers, the DLT tape 7000 hardware had a compression ratio of approximately 1.6:1.
As mentioned previously, raw dumps copy all the bits from the storage volume to the backup media. This captures the bits for any filesystem or database metadata, as well as the actual application data written on that volume. However, the metadata may be out of synch with the data in the volume. This is because the metadata on the volume is not interpreted, and the volume cannot differentiate the backup from another access. To prevent this problem, the volume is typically taken off-line to prevent updates to both data and metadata. Another solution is to mark all entities on that volume read-only for the duration of the backup.
The level of this problem varies depending on the types of filesystems and databases to be backed up. On-line filesystems maintain consistency, and do not require periods of unavailability. However, some higher-level applications may keep their data and metadata in the filesystem, and may need to be taken off-line or otherwise prevented from updating their files during the backup. Prevention from file updates during backups is required so that all the application data can be simultaneously saved and restored in a consistent state.
Another consideration between raw volume and filesystem backup is the atomicity of the data. The raw volume is treated as one large entity, while filesystems are divided into many small logical pieces. The entire dataset needs to be restored to keep one portion of data (e.g., a file or database row) that needs to be recovered from a raw volume dump in a consistent state. Restoring the entire dataset not only takes longer, but it also overwrites any changes to all the other data that had been made since the dump. In addition, incremental backups are currently impossible with raw volumes, because an update to any part of the volume compromises the integrity of the whole. In this case, the whole volume needs to be dumped again. With filesystems, only those files that changed since the last backup need be saved again.
The main advantage of raw volume dumps is the sheer efficiency of dumping raw bits without further interpretation by the system. The disk accesses tend to be large and sequential, minimizing the overhead of system calls and eliminating seeks by the disk drive arms (which are orders of magnitude slower than data transfers).
In contrast, filesystems add additional overhead. The data from file accesses is, by default, buffered in the virtual memory system, and this incurs copies in the kernel. In addition, files are read from disk in directory order and may be scattered in various areas of the disk, causing seeks to pass from one file to the next. This process may reduce the data rate from the disk volume. To perform closer to the level of raw dumps, the filesystem inefficiencies can be minimized through careful configuration and tuning. Nevertheless, there are certain situations where raw dumps are superior, if only for their sheer simplicity.
Filesystems can also offer a number of features that benefit effective backup configuration and planning. Chief among these is the ability to turn on Direct I/O. Direct I/O is a method of accessing files in the filesystem as though they were raw devices. This mainly bypasses the virtual memory buffering, but this may result in a large saving in CPU time, memory usage, and overall wall time. (Despite the benefits of Direct I/O, seeking to various positions on the disk to reach the beginning of file cannot be avoided.) A recent study showed that Direct I/O saved an average of approximately 20% CPU cycles, and kept the system from thrashing during extraordinarily heavy loads.
Direct I/O is available in both VxFS and UFS (starting with the Solaris 2.6 Operating Environment software). VxFS provides various mechanisms for engaging Direct I/O, including a per-I/O option. The most common method, however, is to use a mount-time option to enable this feature for the entire filesystem. UFS also allows Direct I/O to be turned on for the entire filesystem. One additional benefit of VxFS is that a filesystem can be remounted with different options without first unmounting the filesystem. This allows users to remain on-line and active, even when Direct I/O is toggled. This may form a benefit in enterprises where continuous operation is necessary.
Lastly, the VxFS filesystem provides a quick snapshot capability that can mount an additional filesystem as a read-only snapshot of the original. This is done while the original is still active and available. This feature is implemented via a copy-on-write mechanism that makes sure any blocks from the original filesystem are copied out to a special area before the block is changed on disk. A much smaller amount of additional disk space is required to activate the filesystem snapshot capability than from the logical volume manager. This is because only blocks that changed during the snapshot need to be duplicated.
Is the Server Where the Data Resides the One Doing the Backups?
Disk bandwidth should be configured to meet the backup window requirements and keep the tapes streaming. (To keep from back-hitching, the DLT tape 7000 tape drive needs to receive data at a rate no less than 3.5 MB/second.) This may be difficult to ensure, because the server and disk subsystem are often already in place and tuned to perform a specific set of tasks. In this case, to determine if the desired backup window is feasible before planning for a specific set of tape devices, it often helps to measure the sequential rate of the disk subsystem. If the backup window is feasible, but backup performance still suffers due to slow disks, the planner needs to consider reconfiguring or upgrading the disk subsystem as part of the system upgrade path.
Lastly, the planner needs to consider the CPU resources necessary for local backup. Fortunately, these tend to be minimal, especially if Direct I/O is used to access the filesystems. For example, with Direct I/O, a single 250 MHz CPU should be sufficient to backup at 50 MB/second from local disk to tape. If the backups will be concurrent with regular operation and the system is already fully loaded, the additional CPU resources needed for backup may need to be added.
There are some additional factors the planner must consider. If the system has spare processing capacity, the planner must determine how much head-room exists and whether it will be sufficient to meet demands. Secondly, if the backups will be performed at off-peak hours, the planner must determine if there are any other scheduled processes to be run concurrently with the backup, and how much CPU is available for both. The planner also needs to consider sizing and tuning memory, especially if Direct I/O is not used. The main consideration in that case is the shared memory buffers used to coordinate between various backup processes, albeit memory is needed for essentially all system activities.
Is the Data on Remote Clients?
Even with the latest networking technologies, network bandwidth tends to lag behind the bandwidth of storage subsystems. Gigabit Ethernet is theoretically 100 times faster than Ethernet, but at the same time, FiberChannel Arbitrated Loop (FC-AL) offers twice the available bandwidth of Gigabit Ethernet. This discrepancy in bandwidth is unlikely to change anytime soon, because the tolerances in network connectivity tend to be much tighter than for storage. Network bandwidth issues are further complicated by the relatively high cost of upgrading the network infrastructure. While new storage devices can just be plugged in, adding network capacity may mean re-wiring parts of the enterprise. Such infrastructure tends to be very expensive and needs to be planned years in advance. Therefore, even if the upgrade is committed, there is often a period of time where the backup solution needs to work around inadequate network bandwidth.
Because of these network bandwidth issues, a frequent challenge when planning backup solutions is to find ways to satisfy backup requirements within the confines of a given network bottleneck. To understand the overall situation and to obtain a satisfactory solution, the planner needs to find the answers to the following five key questions:
How many clients are there? Knowing the number of clients helps the planner understand the overall scale of the enterprise. It also helps the planner determine aspects of backup planning such as level of multiplexing. Knowing the number of clients is also important because it ties in with the clients' location in the network in relation to the backup server.
What types of clients are there? To understand the client processing capabilities, the planner needs to know the types (i.e., the architecture and operating system) of clients that need to be backed up. For example, if a client has powerful processing capabilities but little network bandwidth to the server, software compression may be a good choice in backing up that client. Both Sun StorEdge Enterprise NetBackup and Solstice Backup software packages offer client-side modules for most platforms.
Do the clients have their own backup devices? If the clients have their own backup devices, the best configuration may be a hierarchal master-slave configuration. In this configuration, the master server initiates and tracks backups, but data goes to the local device. This configuration saves network bandwidth, and can often be significantly faster. The master-slave configuration is recommended on large clients connected to the backup server by a slow network. The backup server is often less powerful than the client it controls, and the main backup devices are attached to the slave clients.
How are the clients distributed? Knowing where in the enterprise network various clients reside helps the planner determine the available network bandwidth between the clients and server. This is necessary information for predicting backup times and data rates available from the clients to disk. Because the network bandwidth is often inadequate, a hybrid solution is most appropriate, in which both network backup of some clients and master-slave configurations are used.
How autonomous are the client systems? Sometimes the client systems are located in remote offices connected to the backup server via WAN (wide-area network) links. These systems often do not have dedicated technical support, and hence need to be managed remotely. By centralizing management, Sun StorEdge Enterprise NetBackup software helps make that task easier. However, certain tasks are necessarily manual, and involve personnel at the remote site. These people will need to be trained to carry out specific tasks associated with backup (e.g., changing tapes in stand-alone drives).
What Does the Disk Subsystem Look Like?
How are the data on the disks laid out? The data layout on the disk affects throughput rate, because it determines whether access to the disk is mostly sequential or random. If the access pattern requires frequent seeks between portions of the disk, the overall throughput rate of data from the disk will dramatically decrease.
There are three reasons that the access pattern may require frequent seeks. The most common one is that the data on the disk was created over a long period of time. In this case, deleted files are left on scattered parts of the disk, and they are subsequently filled by newer files. A seek may then occur to get the next file, because the disk is backed up in directory order. (In this case, one way to obtain mostly sequential access to the existing files--albeit not an ideal process--is to backup all the files once, recreate the filesystem on the device, and then restore all the files from tape.)
Another common cause for this access pattern is that multiple processes are accessing different regions of the disk simultaneously. This results in seeks between the various regions. This can occur, for example, if two different filesystems on the same disk are being backed up simultaneously. In this case, it may be possible to serialize the access by scheduling the backups differently.
A third reason for this pattern is that outer regions of the disk (lower numbered cylinders) tend to be faster than inner regions. Data that needs to be accessed more quickly may be laid out on the outer cylinders.
How are the disks arranged into logical volumes? The logical volume configuration significantly affects performance. To add levels of performance or reliability to the disk subsystem, most enterprise server environments will involve some level of logical volume management, using software or hardware RAID.
RAID-0 (or stripes) volumes tend to increase overall performance, but significantly reduce overall volume reliability. Various combinations of RAID-1 (mirroring) and RAID-0 increase performance while also increasing reliability. RAID-5 also tends to increase both performance and reliability. However, RAID-5 has performance characteristics which slightly complicate backup planning. Approximately two to three times more time should be planned for restoring data to a RAID-5 volume than it took to back it up, because RAID-5 writes (especially small random writes) take significantly longer than reads. The expected reliability of the logical volumes plays a role in determining backup frequency. The RAID volume should probably be backed up more frequently if the following are all the case: the volume has poor reliability (e.g., RAID-0), it is updated often, and it contains valuable data.
How are the disks managed? Another important consideration is the mechanism by which the individual disks are managed or configured into logical volumes. Two possible mechanisms are host-based and hardware RAID. Host-based RAID imposes slightly more overhead on the server system than hardware RAID, but tends to be more flexible. Various volume managers offer different RAID configuration options (e.g., RAID 1+0 vs. RAID 0+1). Some volume managers also offer additional features (e.g., snapshot) that are attractive for backup solutions. A large number of server clients and most workstation/PC clients do not implement logical volume management at all, and are limited to the performance and reliability characteristics of the individual component disks (i.e., JBOD).
What are the disk capabilities? The capabilities of the individual disks also affect disk subsystem performance and reliability. Newer disks tend to be faster and more reliable than older disks. This is not only because of age, but also because of rapid advances in disk technologies. When doing sequential I/O, each disk tends to be capable of a certain data rate, and a certain random seek rate. When the disks are managed as RAID volumes, these capabilities place limitations on the overall logical volume performance. Additionally, different disks have different MTBF (mean-time-between-failures).
What Does the Tape Subsystem Look Like?
Where do the tape devices reside? The planner needs to determine whether the tape devices are stand-alone desktop or rack-mounted units that need to be loaded by hand, or if they are mounted in a robotic library. If they are the former, the planner needs to consider planning for the human interaction required to implement an effective backup solution.
The robotic library is a superior choice for enterprise-level backup solutions. There are many variations of tape libraries, but most commonly they offer multiple tape drives and internal storage capacities in the hundreds of gigabytes.
By knowing the required data capacities, the planner can plan for a sufficient number of libraries to house all the data and to have room to grow. It may be more reliable to purchase a number of smaller libraries than a single very large library, because most tape libraries have only a single robot mechanism.
How many tape drives are there? The planner needs to determine the number of tape drives needed to meet the throughput requirements, and to configure at least that many as part of the libraries. The planner must also remember the SCSI or FC-AL slots on the server needed to connect the tape robotic devices. If there is an existing tape subsystem, they must determine its capabilities and supplement them with new equipment, if necessary. They must also be aware of any forward or backward compatibility issues with the media, because tape formats change almost as frequently as the underlying hardware.
What are the drive capabilities? Each individual type of tape drive has its own characteristics and capabilities. These include native-mode throughput, tape capacity, effectiveness of compression, compatibility of tape formats, and recording inertia. While throughput and capacity are relatively simple, the others also need to be carefully considered.
The actual compression ratio achieved depends mostly on the type of data, but it also depends on the compression algorithm implemented by the drive hardware. For example, the DLT tape 7000 algorithm prefers to trade throughput for compaction, while the EXB-8900 Mammoth 8 mm drive prefers the opposite. Not all tape drives are capable of using older media, even if the form-factor is identical. Most can read tapes written with older formats but cannot write in the older format.
If the backup images are to be archived for a number of years, the upgrade path is also important. The drive technology will chiefly determine the recording inertia. For example, linear recording technologies like the DLT tape 7000 and STK Redwood drives tend to have a stationary read/write head and quickly moving tape. To perform well, these drives need to be fed data above a specific rate. Helical-scan technologies like 8 mm and 4 mm tapes have lower recording inertia and are thus less sensitive to data input rates, but have overall lower throughput capabilities. It is difficult to balance all these factors, but as long as some minimal requirements are met, a suboptimal choice usually has little real effect on the overall performance.
How Are the Tape Devices Distributed?
Are all tapes on the master server? If all tape devices reside on the master server and the bulk of the data is elsewhere, the network needs to support the transfer rates necessary to move data from the remote clients to the centralized backup server. This configuration often simplifies day-to-day management at the cost of a complex networking infrastructure. As noted previously, networks are traditional bottlenecks for backup applications, and need to be configured for optimal performance.
Are tape libraries attached to important servers? An effective backup architecture is to add tape devices to servers where large quantities of data reside, and task them with being backup slave servers, centrally managed from the master server. With this architecture, the only information that is communicated over the network between master and slaves is the file record information, about 200 bytes per file backed up. Both Sun StorEdge Enterprise NetBackup and Solstice Backup software packages support this option.
How close are the tape drives to the data? The proximity of the tape drives to the data is usually an issue of network bandwidth. This is because shorter network distances tend to be covered by higher speed network links. If the tape devices and data are separated by hundreds of kilometers, the link bandwidth is likely to be low. In contrast, if they are located in the same data center, it may simple to configure a point-to-point link, dedicated for backups, between the two. This is mainly important when deciding where to locate the master server in a widely distributed enterprise, because the network architecture and data locations tend to be fixed. A general guideline is to locate the master server as close as possible to the bulk of the data, and hopefully close to a central location in the network topology.
What are the temperature and humidity like? Tapes perform best in moderate temperatures and relatively low humidity. The operating temperature affects things like tape tension and strength, drive part tolerances, and temperature of internal electronic components in the drive. Humidity may affect the longevity of the magnetic coating on the tape. This is because high humidity causes the surface of the tape to become gummy. The ideal operating conditions tend to be listed as part of the media packaging. For example, the DLT CompacTape IV lists operating conditions as 10-40 degrees C, storage as 16-32 degrees C, and humidity between 20-80%. Long-term archive storage (20+ years) requires even more stringent conditions.
How often are the drive heads cleaned? Drive heads need to be cleaned periodically because they pick up deposits with continual use. This is usually accomplished by inserting a cleaning tape. Tapes operating in dirty conditions (e.g., near printers) need to be cleaned more frequently, as do drives that operate outside of environmental specifications. Brand new tapes tend to have some manufacturing debris on the surface, and drives that frequently use brand new tapes should also be cleaned frequently. Both backup software and tape library hardware are capable of automatically inserting cleaning tapes after a certain number of uses.
How old are the drives and tapes? As they get older, tape drives tend to eventually wear out and encounter errors more frequently. Each tape technology has an associated MTBF (mean-time-between-failures), and media has a certain rated number of passes it before it is expected to wear out. These statistics, available from the manufacturers, tend to be optimistic.
The Data's Path
Are Data and Tape Local to Backup Servers?
Is the filesystem buffer cache used? Backups are more efficient when avoiding the filesystem buffer cache. The buffer cache can be bypassed by either using Direct I/O to access individual files, or backing up the raw volume rather than the filesystem.
How much system memory exists? Backup relies on system memory in two capacities. Primarily, it is used for shared memory regions used to implement interprocess communication between various backup/restore processes. Memory is also used when buffering filesystem data in the virtual memory cache. If data is cached in virtual memory faster than old pages can be purged, the system may begin to thrash. More memory temporarily forestalls this condition. However, if the system is in a condition where data is cached faster than purged, it will likely thrash at some point during the course of a long backup.
The most elegant solution is to avoid the buffer cache in the first place, but if that is impossible, the planner needs to tune the memory reclaim rates to be more aggressive. In addition, to improve I/O to the swap device, they also need to stripe-swap across multiple spindles. This may eliminate thrashing, or at least reduce its impact.
What software is being used? The software used determines the overall efficiency with which data is moved from disk to tape. Both Sun StorEdge Enterprise NetBackup and Solstice Backup Power Edition software packages move data very efficiently, but Solstice Backup Network Edition software is a little less efficient. the Solaris Operating Environment software utilities such as tar and ufsdump are not particularly efficient and should not be used to implement enterprise backup solutions.
How much shared memory is available? The amount of shared memory the system can allocate is controlled in the /etc/system file. This file determines the memory used for interprocess control (IPC) between the reader and writer processes in the system. For efficient backup and restore a certain amount of shared memory should be configured per device and data stream.
What are the TCP tunings like? Tuning various parameters for the TCP kernel helps determine the buffer sizes used by the system, and the speed that closed connections in various TCP wait states are flushed from the system.
Are Data and Backup Server(s) Distributed on the Network?
What kind of network is it? Not all networks behave similarly, although all networks tend to be described in terms of their bandwidth. Different networking technologies have different properties. Ethernet variants tend to be inexpensive and common, but their range tends to be limited to local area networks. Within local area networks, there are various topologies that have different performance characteristics (e.g., switched to the desktop vs. hub vs. shared segment).
In addition, the nature of Ethernet causes overall bandwidth to degrade as more nodes are active on the network simultaneously. ATM (asynchronous transfer mode) and FDDI (fiber distributed data interface) networks have longer ranges and degrade more gracefully under heavier loads. However, they use fiberoptic connections, which make them less common and more expensive to install. Gigabit Ethernet and Sun Quad FastEthernet are growing in popularity due to their familiarity and ease of management, but are still not common in existing enterprises.
What is the available network bandwidth? A typical enterprise network consists of multiple segments and various network technologies. The available network bandwidth from one client to another may be vastly different. The planner must estimate the available bandwidth for each key path between backup server and client. This often entails constructing a detailed map of the enterprise network, which may not be available or up to date. To obtain this information requires several days of planning.
How many simultaneous clients are sharing it? If all clients are active at once, the network is more likely to get overloaded than when more clients are on each network segment. However, when there are more clients, the level of multiplexing to the tape drives can be increased. This allows them to stream when a single client is too slow to feed data to the tape at a sufficient rate.
Enterprise Backup Requirements
Backing up the data in a certain period of time
What Is the Backup Window?
What is backed up?
How much data needs to be backed up (full and incremental)? The other part of the equation is the amount of data that needs to be backed up. For consistency and recovery purposes, the ideal backup saves the full set of data. The down side is that the full dataset is usually very large, consuming a lot of time and tape capacity. Most installations choose to perform full backups occasionally, and supplement those with more frequent incremental backups that record only the data that changed.
Sun StorEdge Enterprise NetBackup software offers a number of incremental backup options. Differential backups record files that changed since the last backup (either full or incremental). Cumulative backups record all files that changed since the last full backup. A drawback of cumulative backups is that they usually record more data than differential backups. An advantage is that restoring requires retrieving only from the last full and the last cumulative, rather than fetching the last full and potentially many differential images.
Solstice Backup software offers similar mechanisms, including multiple levels of cumulative backups similar to the levels used by ufsdump(1M). By knowing the potential backup targets of the software and data usage patterns at the site, the planner can estimate approximately how much data will be saved during each type of backup. A target data rate that the backup system should plan to achieve can be obtained by dividing that amount by the estimated time available. Various margins of error can be built into the calculations for added control.
What Is the Acceptable Impact of Performing the Backup?
Is data unavailability acceptable? The planner's central consideration is whether data can be kept from the users for some period of time. If it can, the planner needs to determine how that dedicated time might be best used to perform the backup. If data can be unavailable for some length of time, it is usually possible to back it up faster than keeping it on-line. This may be in the form of shutting down any databases and backing up the raw volume, or unmounting any filesystems and backing up the underlying devices.
Is degraded performance acceptable? If data needs to be continually available but the overall performance of the system may be somewhat degraded, one choice is to continue backups concurrent with user activity. There are a number of mechanisms for on-line backup, and each has a different degree of impact on performance. The planner needs to assess the trade-offs and choose the best possible compromise.
How long is degraded performance acceptable? If data unavailability or degraded performance are acceptable, the planner needs to determine the period of time that must not be exceeded. This period is usually smaller for data availability than degraded performance, but lower performance may lead to overall lower productivity, and thus should be minimized.
If databases are used, are appropriate modules available? Not all commercial databases have corresponding backup modules for the Sun StorEdge Enterprise NetBackup and Solstice Backup software packages. If hot backups of a database or some other high-level system are needed, the planner must verify that an appropriate module is available.
What Availability Concerns Should the Solution Address?
Is it critical to minimize impact of user or operator error? If the major concern being addressed is loss of individual files, the solution should be designed to retrieve the file quickly and with minimal effort on the part of the administrator. Minor issues include tape storage, duplicate media, and offsite import/export. An important issue is backup frequency, because the copy on tape should be as close as possible to the file's final state. The level of multiplexing can be high, because the overall throughput is not an issue when retrieving a small set of files, unless the files are very large.
To address such issues, planners may choose to use disk-based rather than tape solutions. Such solutions include, for example, keeping a third mirror of the volume off-line and readable in case something needs to be retrieved, or backing up important files to a disk directory rather than tape.
Is it critical to minimize impact from loss of equipment? If the goal is to minimize the impact of failed hardware (e.g., disk-head crash), backups can be structured to keep data from the same equipment arranged on the same set of media, and perhaps to duplicate the media. This would minimize tape fetch time from data that spans several tapes. To reduce the chance of losing data to failed hardware, RAID software or devices can be used.
Impacts from hardware failure also relates to highly available and cluster configurations. Configuring backup for these environments is potentially difficult and requires some experience. In situations where the entire system needs to be highly available, the best solution may be specialty contractors like Comdisco.
Is it critical to minimize impact in case of disaster? Disaster recovery and preparations need to encompass all aspects of the operation. These aspects range from frequent training for data center personnel, to using customized scripts for the backup software. The more common steps, however, are to keep multiple copies of media, one local and another archived at a remote site. Another option is to have another site where the data is imported by the backup software and ready for a restore. Some companies choose to have a "hot site" available to go on-line within a few minutes of a disaster, where the configuration has the same capabilities as the original site.
People often expect either the 2:1 compression ratio frequently quoted in the tape hardware literature, or expect similar compression ratios as they see with compress utilities like compress(1) or GNUzip. In the past, the 2:1 number was sometimes touted as "typical", but in truth it was typical only of the special test patterns manufacturers use to test their algorithms. When compressing diverse types of data in the field, the compression ratios were often lower. If capacity planning was done expecting 2:1 compression, the system was often inadequate to the task.
Another typical compression mistake is to compress the target data on the system using compression programs, and use the observed compression ratios to estimate hardware compression. This mistake stems from the different natures of hardware and software compression. Compression utilities can use all of the system memory to perform the compression, and are under no time constraints. Hardware compression is limited to the hardware buffer size, and needs to be compressed in real time. The compression ratio observed with software utilities will usually be much better than the drive hardware can deliver. Inadequate systems can occur if capacity planners use those numbers.
Compression ratios for various types of data (as observed in simple tests) are shown in Table 1. For hardware compression, the more "typical" compression ratio to expect is closer to 1.4:1, although some data types appear to do better. If attempting to save data with little to no redundancy (e.g., compressed video like MPEG or MJPG), it is better for compression in the drive to be turned off. In addition, the compression mechanism has two effects. The first is to speed up the rate at which data is processed by the device, and the second is to compact the data written to tape so that tape can hold more information. 1
The software also writes a certain amount of metadata to tape in order to keep track of what is being written where. This metadata tends to be minor in relation to the dataset size. Simple tests indicate that the metadata written to tape by Sun StorEdge Enterprise NetBackup and Solstice Backup software packages are typically below 1%. Other software (e.g., ufsdump) may write more metadata to the tape, depending on the format used.
The main reason for this performance discrepancy appears to be the nature of writes versus reads. For various reasons, writes to stable storage often take longer than reads. There is also more frequent demand for writes to be performed synchronously (in order to guarantee consistency). For example, creating files requires several synchronous writes to update the metadata keeping track of the file information. Those updates need to be performed in order to preserve file integrity.
Another component of the longer restore times is the browse delay introduced at the start of the request. When a restore request is initially issued, the software needs to browse the file record database and locate all records that need to be retrieved. This may take some time for large databases containing millions of records.
The situation is even more complicated for multiplexed restores. This is because the software usually waits to make sure all requests are received before initiating the actual restore. Alternatively, it may go back to retrieve files that were requested after the restore had already begun. This occurs in order to resynchronize the retrieval of file data intermingled on the same length of media. Otherwise, the restore operation needs to be serialized, constantly rewinding the tape to get each additional backup stream.
Ease of Use and Training Requirements
It is naive to expect to take the software out of the shrink-wrap, uncart the hardware, and put together a well-tuned backup solution. On top of careful planning, even moderately complex backup installations call for trained and experienced personnel to install, configure, and tune the various components. This usually takes several days of dedicated effort. For the most complex installations, it may take multiple weeks to have everything optimally running.
The most successful approach is to bring in experienced consultants (e.g., Sun, Veritas, or Legato professional services) to install and configure the system for current needs, and to teach on-site personnel the basics of maintaining and operating the configured system. The on-site personnel then need to develop in-depth knowledge to be able to modify the configuration to meet increasing demands; this can be achieved through further training or other means. Meeting on-going demands is certainly also possible through long-term contracts with the consulting services that initially configured the systems.
Measurements and Calculations
It is usually easiest for the planner to start from scratch and plan for additional new networks dedicated for backup. Unfortunately, adding these may involve pulling additional wiring between distance corners of the enterprise, far more expensive than the purchase of a few switches and adapters. To meet the new backup demands, most planners need to understand how to efficiently use the existing network infrastructure.
Most realistic environments will already contain significant investment in network infrastructure that can be leveraged for backup and recovery. With the high cost of installing additional wiring, it is usually preferable to strategically place backup servers to use these existing networks.
The planner's first step is to sketch a map of the existing network. Many enterprises will already have such a map, or know who to turn to for this information. The goal is to produce a map showing all relevant network links in relation to one another. These links are then labeled with their expected available bandwidth during the projected backup window. The full bandwidth of the link may not be available for backup, because the networks are usually shared with other users. Network administrators often keep usage statistics that may point to a time when the networks are nearly idle, an ideal time to perform backups. The planner also needs to note how much backup data is located on each segment, and on which machines it is located. Systems that have a large concentration of data may turn into hotspots; therefore, they need to be adjusted.
Once a map of the existing network infrastructure is available, the planner needs to locate the most central point in the network. This is the one that has the most access to plenty of bandwidth and minimizes the overall number of hops to the data. This central point is the ideal place to locate the master server from the standpoint of the network. (Administrative issues may be a different story.) Once the master server is placed, the planner needs to estimate the available bandwidth from the various key data sources to the master server.
If the above estimation process shows that the network would be a bottleneck, the planner needs to consider adding slave servers to various network segments. One situation the planner needs to consider is when there are a few systems that hold the bulk of the data on that segment, and one system in particular is either the largest or least busy. In this case, the planner must consider converting that machine into a slave server by adding a tape subsystem that is sufficient to service the local backup needs. If no such machine is available and all existing machines are fully utilized, the planner should consider adding an additional system on that segment to be the slave backup server. The slave servers will direct all backup data to themselves, limiting the network transfer to the master server to just the file record information.
Estimating Available Bandwidth
Another approach the planner can use is to measure the available bandwidth across key points of the network. There is no fail-proof way of doing this, because most bandwidth measurement tools are invasive, and may or may not mimic the type of load applied by the application in question. Perhaps the easiest way for the planner to measure the available bandwidth is to use the ftp utility. To do this, the planner can use the following procedure:
Once the bandwidth across key routes is estimated, the planner needs to compute the time necessary to transfer the data from the source to the destination:
One thing the planner needs to remember is that the units used to describe network and storage bandwidth tend to be different. Network bandwidth is usually listed as Mb/sec or Mbps, and refers to 1000 x 1000 bits per second. In contrast, storage bandwidth is usually listed as MB/sec or MB/s, and refers to 1024 x 1024 bytes per second. For example, storage bandwidth of 1 MB/s is equivalent to network bandwidth of 8.39 Mbps.
When the disks are combined into logical volumes, RAID performance principles also come into play. Stripes and mirrors tend to aggregate the performance of their component disks. It is often sufficient to simply check that an adequate number of spindles are configured. RAID-5 volumes are more complex, but tend to have good read performance. If large volumes of data need to be restored quickly, to match RAID segment size to the restore I/O size for optimal performance, RAID-5 volumes need to be carefully configured. To make sure that the configuration is satisfactory, the planner needs to test the restore performance for the RAID-5 volumes. Otherwise, when trying to restore in an emergency, they could be unpleasantly surprised. If long-term performance of the backup appears to be problematically slowed by the disk subsystem, the planner can consider upgrading or reconfiguring the storage.
Estimating Available Bandwidth
N is the number of data disks,
For write performance, the planner should use half that value as the estimate. This gives a rough estimate of the raw performance available directly from the disks, assuming no bus or channel bandwidth limitations have been reached. The planner can use Table 3 to estimate the spindle rates and the overall abilities of the storage arrays in question.
For any logical volume configuration, if filesystems are used with Direct I/O on top of the logical volumes, the planner needs to reduce the value by an addition 10% for reads and 15% for writes. If for some reason Direct I/O cannot be used, the planner can divide the calculated raw value by 2 for reads, and 3 for writes.
Once the available disk bandwidth is calculated for all logical volumes, the planner should consider how the volumes are laid out on top of the multiple busses and I/O channels (e.g., SCSI, FC-AL, SBus). If the aggregate volume bandwidth exceeds the bus bandwidth, they can assume all logical volumes can share the bus equally, and divide the available bus bandwidth among the competing volumes.
Another method the planner can use for estimating available bandwidth is to measure it using simple tools. The most easy tool to come by is dd(1), which can easily generate a sequential stream of accesses to either a file or a raw device. To test potential backup performance, the planner can create a large file on the source disk subsystem. On the host, they can time a dd process reading from the large file and writing to /dev/null using blocksizes similar to the backup software (a 64 KB block size is a good guess). The planner can divide the file size by the time it took to transfer read all the contents, obtaining an approximation for the disk bandwidth. If the raw device bandwidth is required, they can read a certain number of blocks from the raw device, and compute the rate based on the number of blocks transferred rather than the file size.
If restore performance is the goal, they can write to a file from /dev/zero for filesystem performance. Writing to a raw device only works if there is no valid data on that device. (The planner must take caution when writing to a raw device using dd, because this will likely destroy any data on that device.) These estimates are likely higher than the likely performance during backup, so they can use perhaps 80% of the measured value in planning.
A more accurate method would be to use the actual programs used by the backup software and direct their output to /dev/null. This measures the exact data access load on the disk in isolation from other potential bottlenecks, such as networks and tape drives. The exact invocation varies from package to package and filesystem versus raw device. The software CLI documentation should provide the necessary details to conduct this test, although this method is most useful when troubleshooting or tuning an existing installation. This is because it requires the software and data to already be in place.
Lastly, it is generally not a good method to measure backup performance using standard the Solstice Backup utilities like tar or ufsdump. These programs are not especially tuned for high performance, and may bottleneck somewhere other than the disk subsystem.
When trying to back up an existing enterprise with multiple slow clients, networks, or logical volumes, the planner should configure multiplexing in the backup schedules. This allows each tape device to be fully utilized. Each multiplexed stream uses a finite amount of resources (e.g., TCP ports, buffers, CPU) on the server, so the total number of backup streams handled by a server simultaneously should be kept below approximately 120 3 .
Estimating Available Bandwidth
It is also important for the planner to consider the SCSI bus bandwidth. Generally, they should plan on a maximum of two or three tape devices per SCSI bus, and should not mix tape and disk devices on the same bus. Empirical tests of tape bandwidth can be easily accomplished using the dd(1) and mt(1) commands, although access to library robotics from the system requires additional software to drive the robot.
To simplify the configuration of CPU capacity, the planner can estimate how many CPU cycles are needed to move data at a certain rate. Simple experiments have shown that a useful, conservative estimate is 5 MHz of UltraSPARC CPU capacity per 1 MB/second of data that needs to be moved. This means that for every MB/second of data moved (whether over the network, from disk, or to tape), the system should have 5 MHz of processing power available for the transfer. For example, a system that needed to back up a number of clients over the network to local tape at a rate of 10 MB/second would need 100 MHz of available CPU power. This included 50 MHz to move data from the network to the server, and another 50 MHz to move data from the server to the tapes. This would keep a 300 MHz UltraSPARC processor at 33% utilization. As another example, a system that needed to back up a database residing on local disks to local tape device at a rate of 35 MB/second would need 350 MHz of available CPU power. The actual software overhead is small, and is included in the 5 MHz per MB/second number.
Capacity planning is not a straight-forward procedure; it requires knowing how to efficiently use the network infrastructure, and understanding network performance and bandwidth issues. The planner needs to configure the network for optimal backup and recovery performance. And because networks are traditional bottlenecks for backup applications, the capacity planner often needs to choose the preferred bottleneck.
While a complex process, the planner can follow a series guidelines and use a number of available tools and methods to obtain information necessary for making good decisions. The planner first needs to assess the environment to be backed up. This includes obtaining the following information: (1) the data type, (2) the file structure, (3) the data origin, (4) the data destination, and (5) the data's path. The planner also needs to know whether data and backup servers are distributed on the network, and if so, how they are distributed.
Knowing the backup requirements of the enterprise is also essential. This includes determining the time period available for backups, the needs for restoring the data, and ways to limit the impact of the process on day-to-day operations.
Finally, the planner needs to maintain realistic expectations. This means accounting for data overhead caused by additional metadata, understanding the ease of use in the backup process and assessing training requirements, and understanding the recovery performance.
data base management system (DBMS)
interprocess control (IPC)
point-to-point protocol (PPP)