Your browser does not support the Canvas
XX DISASTER RECOVERY PLAN

 Document Information

Version: v1

1-Introduction

This Disaster Recovery Plan (DRP) captures, in a single repository, all of the information that describes XX’s ability to withstand a disaster as well as the processes that must be followed to achieve disaster recovery.

2 –Definition of a Disaster

A disaster can be caused by man or nature and results in XX’s IT function not being able to perform all or some of their regular roles and responsibilities for a period of time.  XX defines disasters as the following:

  • One or more vital systems are non-functional
  • The building is not available for an extended period of time but all systems are functional within it
  • The building is available but all systems are non-functional
  • The building and all systems are non functional

The following events can result in a disaster, requiring this Disaster Recovery document to be activated:

  • Pandemic
  • Power Outage
  • Theft
  • Terrorist Attack
  • Ransomware/Virus

 3- Purpose

To ensure XX’s continuity of service and product to its’ clients, it is intended that the Business follow the steps in this DRP to restore services to business-as-usual as quickly as possible. This includes:

  1. Preventing the loss of the organisations’ resources such as hardware, data and physical IT assets
  2. Minimising downtime related to IT
  3. Keeping the business running in the event of a disaster

This DRP document will also detail how this document is to be maintained and tested.

 4-Scope

The IT-DRP provides an IT risk and mitigation reference in addition to technical and operational information. Any processes and actions described are only applicable during a designated major incident which is to be determined in conjunction with XX’s Business Continuity Plan

This document is intended as a high-level approach and does not specifically itemise all of the steps required during response/recovery for any individual system.

This IT-DRP should consider all technologies used within XX including new and emerging ones. The top-level areas of consideration are:

Physical Infrastructure

  1. Internet connectivity
  2. Network – all technologies (LAN, Wi-Fi, RF, Bluetooth, IoT, Proprietary, etc)
  3. Servers & Storage
  4. Client devices & peripherals (Computer, USB devices, etc)
  5. Organisational and Support Devices (Dedicated systems, Printing, etc)
  6. Telephony and Communications
  7. Building Management & Digital Signage
  8. Software Systems – Cloud provisioning
    1. Infrastructure-as-a-Service (IaaS)
    2. Platform-as-a-Service (PaaS)
    3. Software-as-a-Service (SaaS)
    4. Data Storage v. Databases

5-Definition of Services & Systems

From an IT perspective, XX aims to operate as a 100% cloud-based organisation. This means that (wherever possible) applications, data & facilities are leveraged wholly through offsite mechanisms.

This approach enables an extremely light-weight and versatile infrastructure with no requirement for on-site servers or data centres, thus reducing maintenance and upgrade requirements. The implementation and operation of a zero-trust network further mitigates against certain types of risks.

Although currently XX does not operate servers or a data centre there may be a point in the future where this becomes a necessity and for this reason risks, mitigations and procedures should be mentioned and expanded upon when appropriate.

Telecommunications within XX are also provided through cloud services using IP connectivity.

Use of IT within XX is broadly attributable to two primary categories

  1. Operational Services Organisational and Administrative functionality delivered by employees, contractors, and agents of XX.
  2. XX Client Platform with participatory functionality received by clients

By operating as a 100% cloud-based organisation XX has a very high degree of being able to continue functioning and providing these primary IT services through many physical disaster scenarios including total loss of buildings. The ability to utilise services from “any internet connection” provide a huge benefit in terms of both business continuity and educational delivery.

With this model, the emphasis for IT-DRP is therefore shifted to the appropriate protection of cloud services and internet provision

Definition of Mitigation Measures

6.1 Secondary Service provision

  1. Where possible, a secondary service provision should be utilised to enable failover (whether automatic or manual) from one service to another.
  2. For Cloud Applications, Software-as-a-Service delivery and Virtual Machine environments, this would be in the form of multiple vendor Cloud provision. The primary Cloud vendor for XX is Microsoft utilising the Azure platform and Office 365.

6.2 Resilient diverse connectivity

  1. Multiple device utilisation / Overcapacity / Redundancy / High Availability
  2. The use of numerous identical devices to deploy services enables equipment to be swapped out and exchanged. This is an important facility during issue diagnosis and hardware failure but also allows for rapid resolution in critical areas over less important ones.
  3. Overcapacity planning provides an opportunity to redeploy resources in the event of an issue and helps to mitigate against unexpected peaks in utilization.
  4. For mission critical or key hardware, redundant replacement devices provide the quickest resolution time scales.
  5. All of these processes should be utilised in conjunction with appropriate cost analysis.
  6. High Availability is provided by a combination of the above and is particularly relevant in cloud orientated service deployment. By provisioning server farms, access gateways and load balancers, workloads can be distributed between facilities and in the event of a failure, can be directed, often seamlessly, to the remaining working component parts. These high availability components can be separated geographically to help mitigate against widespread incidents.

6.3 Alternate Location (Inherent ability)

  1. The inherent ability to operate from an alternate location is a key concept in the mitigation against many physical location-based incidents. By operating with a cloud-first philosophy and running a zero-trust network model with extensive wi-fi implementation, many of the core administration and operational functions of XX benefit from this inherent ability.

6.4 Intrusion Detection, Conditional Access, Active Monitoring & Appropriate permissions

  1. An important capability of the zero-trust network model is that of connection monitoring. This is used in a number of ways such as to provide information into utilisation, load and distribution, but also to assist with intrusion detection and to ensure that only approved devices that have gone through an onboarding process are able to work on XX networks.
  2. Conditional Access is constantly run on every request for information from the array of cloud services – whether email, document storage, records, etc – the assessment of a devices security status, location, operating system, antivirus status etc in conjunction with the requesting users status, and key metrics such as atypical behaviour and impossible travel are used to form a real-time conditional accept/reject decision. Data is only return to the requester providing all criteria are satisfied.
  3. Appropriate permissions and restrictions must be set according to users job roles and organisational requirements. No default open access should exist.

6.5 Physical protective measures

  1. Cable locations – whether internal or external to buildings – should be physically protected with suitable conduits and covers and where possible located out of reach.
  2. Wireless access points (WAP), CCTV and other suXXce mounted equipment should be ceiling mounted or at high level on walls, and where appropriate enclosed in tamper/vandal proof enclosures.
  3. Although not necessary in all locations, it should be assessed whether network ports in dado/trunking or wall boxes should have lockable capability to guard against accidental or malicious removal, or attempted access to the network infrastructure.
  4. Door Access Security. It is important to ensure that areas containing key infrastructure components have controlled access.
  5. All network cabinets should be lockable, and keys should be kept in at least 2 places to ensure availability in the event of emergency access being required.

6.6 Backup & Replication

  1. An appropriate backup plan must be implemented for all systems. This is the primary resource that will be used during the recovery process and in order to hit RTO and RPO criteria this must be tested regularly.
  2. There are numerous types of backup process, but it is vital to ensure that chosen methods do not create a new vulnerability or exposure of information.

6.7 Policy

  1. There are policies such as the Acceptable Use Policy that determine appropriate regulation and control of passwords, restrict access/functionality on devices.

6.8 API Integration

  1. Many cloud-based applications are provided with little or no actual control of the underlying data backup and retention. Where these third-party providers are holding data on our behalf, suitable Service Level Agreements (SLA) are required to ensure that operational continuity is achieved. This is not, however, sufficient for a full DRP.
  2. By ensuring that all cloud applications and services engaged with by XX have a detailed and well documented Application Programming Interface, it is possible to ensure that we not only have visibility of data stored outside of our immediate control, but that we also have a path to retention and backup of that data.

6.9 Encryption

  1. The encryption of data both in transit and at rest is a critical requirement for the security of XX data and is documented in the XX Security and Data Storage policy. During a recovery process, due care must be taken to ensure that data security and integrity is maintained, especially around encryption.

6.10 Remote Wipe

  1. The ability to remotely wipe a device, block or control access to content is built into many mobile devices. This ability can be leveraged to ensure data security during an incident.

6.11 Advanced Threat Protection / Anti-Virus

  1. b. XX leverages Advanced Threat Protection to actively monitor and report on devices as part of Conditional Access.

7-Definition of Risks

Infrastructure Description Risk Mitigation Measures
Internet Connectivity Various fibre connections to buildings Service Provider Failure

 

Damage to fiber Cabling

 

 

Local Hardware Failure (routers, switches etc)

 

Natural Disaster / Loss of Building Integrity

 

Malicious attack (Denial of Service, Hacking etc)

 

 

Secondary service provision

 

Resilient diverse connectivity Secondary service provision

 

Multiple device utilisation Overcapacity / Redundancy

 

Alternate location/Inherent ability

 

 

Intrusion detection

Conditional Access

Appropriate permissions

Data Centre and Servers Virtual Machine Estate PaaS provision Service provider failure

VM failure or corruption

Backup & Replication

Secondary service provision

High availability

Wi-Fi and Lan Estate IaaS provision Hardware failure

Interference

Overloading

Electrical failure

Multiple device utilisation Overcapacity / Redundancy Secondary service provision Resilient diverse connectivity
Email & communication Primary Data Storage Operational Applications Faculty Applications Various SaaS providers Service provider failure Communication failure

Corruption of data

Loss of data Data

Exposure / Compromise

Commercial exploitation

Sabotage / Internal attack Erroneous or neglectful activity Inability to operate / application failure

Backup & Replication

Litigation Hold Policy

Retention Policy

Secondary service provision Conditional Access

API integration

Organisational Devices Personal Devices Laptops

Tablets

Desktops

Phones

Wearable technology

BYOD

Home PC

Hardware failure

Data exposure / Password compromise

Loss / Theft

Inability to operate

External devices (USB stick)

Virus / Malware / Attack

Overcapacity / Redundancy

Local drive encryption

Conditional Access

Password policy

Remote Wipe capability

Device restriction policy

Advanced threat protection

Active monitoring

Appropriate use policy

Support Devices Multi-Function Printers

Scanners

Desktop Printers

Hardware Failure Overcapacity / Redundancy
Business Devices Dedicated Resources Hardware Failure Resilient diverse connectivity
Security Door Access

CCTV

Hardware failure

Vandalism / sabotage

Manual operation / override Physical protective measures

 

8- Recovery Time Objectives / Recovery Point Objectives

8.1 For each service or data source, there are two parameters that determine the amount of time it takes to recover a service or data to the required operational state (RTO) and the potential maximum amount of data that will be lost as a result of the incident (RPO)

8.2 These two-time parameters are a property of the backup and replication processes that are undertaken and therefore in order to meet organisational requirements, there may be more than one type of process defined and implemented.

8.3 The RPO for a service is concerned with the allowable amount of data loss in the event of a disaster. If the RPO is 24 hours, then all data produced by the service must be backed up (including the time taken for the backup) at least every 24 hours to ensure that this objective is met.

8.4 RPOs are data only based and are generally automated. There is usually a direct cost relationship.

8.5 The RTO for a service or data is concerned with how long a service or data can be unavailable before causing irreparable damage to the organisation. If the RTO is 24 hours, then the service must be restored within that time in order be within requirement.

8.6 Since the RTO is generally associated with a whole operational capability rather than just the data, there is a higher cost relationship with more demanding RTO. The process is nearly always manual and due to the fact, that restore times can vary depending on the time of day and other business loads, it is important to ensure that an RTO is achievable. If an RTO is 2 hours and it takes 4 hours to restore a service at peak times, then it will never be achievable.

8.7 The RTO and RPO parameters for each service are documented below

8.7.1 Resiliency/Availability

  1. Service availability for M365 services is as-per the Microsoft 365 Service Level Agreement. The M365 cloud service is designed to be resilient and highly available, but Microsoft provides no contractual recovery time objective (RTO) or recovery point objective (RPO) for data
  2. In the event of an outage to the M365 Exchange Online service (only), we would utilise <Product>Continuity Mode which will continue to provide access to email services for the duration of any outage to Exchange Online
  3. The SLAs provided for M365 data recovery will only cover M365 data that is protectable. Protectable data details are complex, and subject to changes as the M365 SaaS services evolves, but current protectable data is Exchange Online (all data), OneDrive*, Teams*, Sharepoint Online*, M365 Groups*, and Project Online*. *caveats, limitations and exclusions apply.
  4. Within the terms of this SLA “minor” data loss is determined as any volume of M365 data below 2GB (examples: single mailboxes, small volumes of data such as files, folders, conversations, etc.)
  5. Within the terms of this SLA “major” data loss is determined as any volume of M365 data above 2GB
  6. Within the terms of this SLA “catastrophic” data loss is determined as very rare/unlikely events which result in comprehensive or complete organisational data loss across M365 services

8.7.2 Recovery/DR

  1. In the event of data loss, data recovery will rely on M365 service availability as a pre-requisite before data recovery can begin, potentially impacting RTO SLAs
  2. Where “minor” data loss has occurred, the achievable RTO will be 4 hours (accounting for the time it takes to log the ticket, get assigned, and restore).
  3. Where “major” organisational data loss has occurred, achievable RTO could be as high as 72 hours (Considering incredibly rare mass data loss events).
  4. Where “catastrophic” organisational data loss has occurred, and Microsoft are unable to facilitate a local data recovery natively, no particular RTO SLA is provided, but such an event may for guidance purposes take > 1 week to fulfil a complete recovery. Improved RTO SLAs could be provided in this scenario for prioritised data if this data is identified in advance of such an outage
  5. Protectable M365 data within Teams, Exchange Online, Sharepoint Online, M365 Groups, and Project Online will have an RPO of 12 hours.

8.8   Overriding security access & permissions are set according to job role within XX and practices are documented in the Security policy.

8.9  During a disaster recovery incident, the authority to action recovery of systems and data lies with the Head of IT/DR Team Leader. In certain severe cases where the impact to XX is high it may be necessary to elevate authority to the Senior Leadership Team following assessment and evaluation of the costs and timescales involved in the recovery. Prioritisation, Action & Communication

8.10 The CTO / CIO / Equivalent Person is responsible for setting the prioritisation and co-ordination of recovery tasks and has overriding control on the actions that are carried out on behalf of XX by any third parties.

9-Validation of Recovery

  1. The responsibility for validating successful recovery of services and data lies with the Head of IT/DR Team Leader and as such an incident will not be considered to be closed until the Incident Report has been submitted and approved by the Senior Leadership Team.
  2. Recovery performed by service providers on behalf of XX will also be validated by the Head of IT/DR Team Leader and reported on in the same manner as internally handled recovery

10-Service Level Agreements / Insurance

SLAs are to be sought from all primary providers and are either referenced within the contact details section of this document or held with the contract to supply. Insurance policies held by both XX and service providers should be checked prior to the financial commitment of any large-scale recovery operation to ensure any requirements are not overlooked which would invalidate such insurance policies.

11-Disaster Recovery Teams & Responsibilities

In the event of a disaster, a DMT will be required to assist the IT department in their effort to restore normal functionality to the employees of XX. 

12-Disaster Management Team

The Disaster Management Team that will oversee the entire disaster recovery process. They will be the first team that will need to take action in the event of a disaster. This team will evaluate the disaster and will determine what steps need to be taken to get the organization back to business as usual.

This will be the team responsible for all communication during a disaster. Specifically, they will communicate with XX’s employees, clients, vendors and suppliers, banks, and even the media if required.

The Disaster Management Team will be led by the CTO / DR Lead – James Wilson

13- Role & Responsibilities

  • Set the DRP into motion after the Disaster Recovery Lead has declared a disaster
  • Determine the magnitude and class of the disaster
  • Determine what systems and processes have been affected by the disaster
  • Communicate the disaster to the other disaster recovery teams
  • Determine what first steps need to be taken by the disaster recovery teams
  • Keep the disaster recovery teams on track with pre-determined expectations and goals
  • Keep a record of money spent during the disaster recovery process
  • Ensure that all decisions made abide by the DRP and policies set by XX
  • Ensure that the secondary site is fully functional and secure
  • Create a detailed report of all the steps undertaken in the disaster recovery process
  • Notify the relevant parties once the disaster is over and normal business functionality has been restored
  • After XX is back to business as usual, this team will be required to summarize any and all costs and will provide a report to the Disaster Recovery Lead summarizing their activities during the disaster 

14- Contact Information

14.1 Disaster Management Team in XX.

Name Role/Title Work Phone Number Mobile Phone Number

14.2 Senior Management Team

The Senior Management Team will make any business decisions that are out of scope for the Disaster Recovery Lead. Decisions such as constructing a new data center, relocating the primary site etc. should be make by the Senior Management Team. The Disaster Recovery Lead will ultimately report to this team.

Role & Responsibilities

  • Ensure that the Disaster Recovery Team Lead is help accountable for his/her role
  • Assist the Disaster Recovery Team Lead in his/her role as required
  • Make decisions that will impact the company. This can include decisions concerning:
    • Rebuilding of the primary facilities
    • Rebuilding of data centers
    • Significant hardware and software investments and upgrades
    • Other financial and business decisions

Contact Information

Add or delete rows to reflect the size of the Management Team in your organization.

Name Role/Title Work Phone Number Home Phone Number Mobile Phone Number

 

15-Disaster Recovery Call Tree

In a disaster recovery or business continuity emergency, time is of the essence so XX will make use of a Call Tree to ensure that appropriate individuals are contacted in a timely manner.

  • The Disaster Recovery Team Lead calls all Level 1 Members (Blue cells)
  • Level 1 members call all Level 2 team members over whom they are responsible (Green cells)
  • Level 1 members call all Level 3 team members over whom they are directly responsible (Beige cells)
  • Level 2 Members call all Level 3 team members over whom they are responsible (Beige cells)
  • In the event a team member is unavailable, the initial caller assumes responsibility for subsequent calls (i.e. if a Level 2 team member is inaccessible, the Level 1 team member directly contacts Level 3 team members).

Add as many levels as you need for your organization.

Contact Office Mobile Home
DR Lead 111-222-3333
DR Management Team Lead

R

DR Management Team 1
DR Management Team 2
DR Management Team 3
Management Team Lead

 

Management Team 1

 

A Disaster Recovery Call Tree Process Flow diagram will help clarify the call process in the event of an emergency.

16-Recovery Facilities

XX has adopted a blended approach to working. As such all employees have the necessary equipment, policies and processes to work from home. We do recognise however that a short term office solution may be required In order to ensure that XX is able to withstand a significant outage caused by a disaster, it would therefore provision separate dedicated standby facilities on a short term basis utilising the “we Space” model as it did in Stockport.

16.1 Description of Recovery Facilities (optional)The Disaster Command and Control Center or Standby facility will be used after the Disaster Recovery Lead has declared that a disaster has occurred and only if specifically deemed necessary and appropriate for a defined period of time and to house predominantly the outward call teams. All Any such location would be a separate location to the primary facility. The location to be defined as when. All IT function employees work from home currently.

The standby facility will be used by the Disaster Recovery teams; it will function as a central location where all decisions during the disaster will be made. It will also function as a communications hub for XX.

The standby facility must always have the following resources available:

  • Copies of this DRP document
  • Sufficient e infrastructure to support enterprise business operations
  • Office space for DR teams and Outward Call  to use in the event of a disaster
  • External data and voice connectivity
  • Bathroom facilities (Including toilets, showers, sinks and appropriate supplies)
  • Parking spaces for employee vehicles

17-Data and Backups

17.1 MandatoryData in Order of Criticality

The below list itemizes all of the data in our in order of their criticality.

Rank Data Data Type Back-up Frequency Backup Location(s)
1 <<Data Name or Group>> <<Confidential, Public, Personally identifying information>> <<Frequency that data is backed up>> <<Where data is backed up to>>
2
3
4
5
6
7
8
9
10

18-Azure Specific

XX uses Azure PaaS disaster recovery to secure our databases against catastrophic loss in the event of a major disaster. It provides a platform where our development teams can manage the applications without additional complexity or maintaining the infrastructure of an app.

XX utilises Azure disaster recovery plan template to secure our PaaS database.

18.1 Azure Disaster Recovery ScenariosThere are several possible factors related to PaaS disaster thar XX has considered applicable to its Business Operations. Out IT teams are aware of these so that in the event of a disaster, data recovery can be done efficiently. Region-wide service interruption is not only responsible for application-wide failures. Poor design and administrative errors can also be the cause of outage.

  1. Application Failure: Application errors can occur due to catastrophic exceptions caused by bad logic or data integrity issues. If the administrator has full knowledge of Azure PaaS disaster recovery processes, resolving critical errors becomes easy.
  2. Data Corruption: Microsoft Azure automatically stores data on Azure SQL Database and it stores the data three times in different fault domains in the same region. If users use geo-replication then data stored in different region three additional times. If the primary copy gets corrupted then, data quickly replace with the other copies, and multiple copies is the cause of corrupt data.
  3. Network Outage: Data may become inaccessible when networks are unable to access. If we are unable to access our data in PaaS due networks outage, then we are required to have in place and the design of a PaaS disaster recovery template to run with reduced functionality in our apps.
  4. Region-Wide Service Disruption: Admin should prepare for service interruption of the entire region. In the event of region-wide service disruption, the local copies of data get missed. You can recover the data if you have geo-replication. Once Microsoft declares the region is lost, azure remaps all DNS entries to the geo-replicated region.
  5. Azure-Wide Service Disruption: While planning the disaster recovery solution we have considered the entire range of possible disasters. It may possible that all the region get affected by the disaster. For that, you must prepare in prior. Take backup of PaaS database so that we can access our data while a critical catastrophe occurs.

By using the Azure disaster recovery plan template we have assured that all data can be recovered.

We have defined the minimum level of functionality required during the disaster and will implement the DR plan to minimise the risk. Azure PaaS disaster recovery is focused on recovering from a catastrophic loss of application functionality. For that, we have been required to plan to run the application and access the data in another region.

18.2 Utilising Azure therefore allow XX to adopt:

  1. Multiple Data Center Regions: Azure maintains multiple data center in different regions all over the world. This supports several PaaS disaster recovery scenarios. As it provides geo-replication of Azure storage to secondary regions. Storing data into multiple regions helps to secure the application from major disaster.
  2. Azure Site Recovery: It provides a way to replicate virtual machines between region. It requires minimum management overhead, as it doesn’t require any additional resources in the second region. It provides automated continuous replication and enables to perform failover with a single click.
  3. Admin can run Azure PaaS disaster recovery drills by testing failover, without disturbing production workload or ongoing replication.
  4. Azure Traffic Manager: Whenever a specific- region failure occurs, it redirects traffic to services in another region. It is the most convenient way to handle a disaster via Azure traffic manager automatically redirect the traffic when primary region gets fail. Therefore, traffic manager is important, while designing a PaaS disaster recovery plan and keep it in DR strategy

18.3 How Azure Recovery Services Work

At the moment of disaster we can access the data from our portal. Azure works remotely and continually monitors the servers for data centre failure. A security key establishes the connection between cloud environment and on-premises

19-Communicating During a Disaster

In the event of a disaster XX will need to communicate with various parties to inform them of the effects on the business, surrounding areas and timelines. The Communications Team will be responsible for contacting all of XX‘s stakeholders.

19.1 Communicating with EmployeesThe Communications Team’s second priority will be to ensure that the entire company has been notified of the disaster. The best and/or most practical means of contacting all of the employees will be used with preference on the following methods (in order):

  • E-mail (via corporate e-mail where that system still functions)
  • E-mail (via non-corporate or personal e-mail)
  • Telephone to employee home phone number
  • Telephone to employee mobile phone number

The employees will need to be informed of the following:

  • Which services are still available to them
  • Work expectations of them during the disaster

Employee Contacts

Add or delete rows to reflect the employees in your organization.

Name Role/Title Home Phone Number Mobile Phone Number Personal E-mail Address

 

19.2 Communicating with Clients

After all of the organization’s employees have been informed of the disaster, the Communications Team will be responsible for informing clients of the disaster and the impact that it will have on the following:

  • Anticipated impact on service offerings
  • Anticipated impact on delivery schedules
  • Anticipated impact on security of client information
  • Anticipated timelines

19.3 Communicating with Vendors

After all of the organisation’s employees have been informed of the disaster, the Communications Team will be responsible for informing vendors of the disaster and the impact that it will have on the following:

  • Adjustments to service requirements
  • Adjustments to delivery locations
  • Adjustments to contact information
  • Anticipated timelines

Crucial vendors will be made aware of the disaster situation first. Crucial vendors will be E-mailed first then called after to ensure that the message has been delivered. All other vendors will be contacted only after all crucial vendors have been contacted.

Vendors encompass those organisations that provide everyday services to the enterprise, but also the hardware and software companies that supply the IT department. The Communications Team will act as a go-between between the DR Team leads and vendor contacts should additional IT infrastructure be required.

Company Name Point of Contact Phone Number E-mail
IT Provider
       

Communication of relevant DR information throughout XX will also be provided by DR Team Leader and during severe instances to the Senior Leadership Team for appropriate communication to all employees contractors. It is essential that accurate information regarding expectations and realistic recovery timescales is conveyed through proper channels.

In some events such as that of personal data compromise or loss, the DPO must be informed in order to discharge the legal obligations of informing relevant authorities such as the Information Commissioners Office.

Press releases or external communication is strictly under the control of the Senior Leadership Team Dissemination of information by other paths will be viewed as inappropriate and may well instigate disciplinary action.

20- Activating the Disaster Recovery Plan

If a disaster occurs in XX, the first priority is to ensure that all employees are safe and accounted for. After this, steps must be taken to mitigate any further damage to the facility and to reduce the impact of the disaster to the organisation.

Regardless of the category that the disaster falls into, dealing with a disaster can be broken down into the following steps:

  • Disaster identification and declaration
  • DRP activation
  • Communicating the disaster
  • Assessment of current and and prevention of further damage
  • Standby facility activation
  • Establish IT operations
  • Repair and rebuilding of primary facility

20.1 DRP Activation Once the Disaster Recovery Lead has formally declared that a disaster has occurred s/he will initiate the activation of the DRP by triggering the Disaster Recovery Call Tree. The following information will be provided in the calls that the Disaster Recovery Lead makes and should be passed during subsequent calls:

  • That a disaster has occurred
  • The nature of the disaster (if known)
  • The initial estimation of the magnitude of the disaster (if known)
  • The initial estimation of the impact of the disaster (if known)
  • The initial estimation of the expected duration of the disaster (if known)
  • Actions that have been taken to this point
  • Actions that are to be taken prior to the meeting of Disaster Recovery Team Leads
  • Scheduled meeting place for the meeting of Disaster Recovery Team Leads
  • Scheduled meeting time for the meeting of Disaster Recovery Team Leads
  • Any other pertinent information

If the Disaster Recovery Lead is unavailable to trigger the Disaster Recovery Call Tree, that responsibility shall fall to the Disaster Management Team Lead

20.2 Assessment of Current and Prevention of Further Damage

Before any employees from XX can enter the primary facility after a disaster, appropriate authorities must first ensure that the premises are safe to enter.

The first team that will be allowed to examine the primary facilities once it has been deemed safe to do so will be the Facilities Team. Once the Facilities Team has completed an examination of the building and submitted its report to the Disaster Recovery Lead, the Disaster Management, Networks, Servers, and Operations Teams will be allowed to examine the building. All teams will be required to create an initial report on the damage and provide this to the Disaster Recovery Lead within 72 hours of the initial disaster.

During each team’s review of their relevant areas, they must assess any areas where further damage can be prevented and take the necessary means to protect XX’s assets. Any necessary repairs or preventative measures must be taken to protect the facilities; these costs must first be approved by the Disaster Recovery Team Lead.

21-Reporting & Incident Analysis

Full documentation of the incident must be completed in an appropriate timescale. The following must be included in the report:

  1. Date & Time of initial incident occurrence
  2. Method and details of notification
  3. Steps leading to identification of the problem
  4. Assessment criteria used to formulate a level of response
  5. Third Party / Service Provider involvement
  6. Description of causes vii. Events record for each step/action/process undertaken including outcomes
  7. Resources involved
  8. Effectiveness of solution and analysis of target objectives/actuals
  9. Possible list of mitigations
  10. Lessons to be learned

22-Standby Facility Activation

The Standby Facility will be formally activated when the Disaster Recovery Lead determines that the nature of the disaster is such that the primary facility is no longer sufficiently functional or operational to sustain normal business operations.

Once this determination has been made, the Facilities Team will be commissioned to bring the Standby Facility to functional status after which the Disaster Recovery Lead will convene a meeting of the various Disaster Recovery Team Leads at the Standby Facility to assess next steps. These next steps will include:

  1. Determination of impacted systems
  2. Criticality ranking of impacted systems
  3. Recovery measures required for high criticality systems
  4. Assignment of responsibilities for high criticality systems
  5. Schedule for recovery of high criticality systems
  6. Recovery measures required for medium criticality systems
  7. Assignment of responsibilities for medium criticality systems
  8. Schedule for recovery of medium criticality systems
  9. Recovery measures required for low criticality systems
  10. Assignment of responsibilities for recovery of low criticality systems
  11. Schedule for recovery of low criticality systems
  12. Determination of facilities tasks outstanding/required at Standby Facility
  13. Determination of operations tasks outstanding/required at Standby Facility
  14. Determination of communications tasks outstanding/required at Standby Facility
  15. Determination of facilities tasks outstanding/required at Primary Facility
  16. Determination of other tasks outstanding/required at Primary Facility
  17. Determination of further actions to be taken

During Standby Facility Operations, Networks, Servers, Applications, and Operations teams will need to ensure that their responsibilities, as described in the “Disaster Recovery Teams and Responsibilities” section of this document are carried out quickly and efficiently so as not to negatively impact the other teams.

23-Reestoring IT Functionality

Should a disaster actually occur and XX need to exercise this plan, this section will be referred to frequently as it will contain all of the information that describes the manner in which XX’s information system will be recovered.

This section will contain all of the information needed for the organisation to get back to its regular functionality after a disaster has occurred. It is important to include all Standard Operating Procedures documents, run-books, network diagrams, software format information etc. in this section.

24, Current System Architecture

In this section, include a detailed system architecture diagram. Ensure that all of the organization’s systems and their locations are clearly indicated.

<<System Architecture Diagram>>

24.1 IT Systems
XX list all of the IT Systems in our organisation in order of their criticality. Further a list of each system’s components that will need to be brought back online in the event of a disaster. Add or delete rows as needed to the table below.

Rank IT System System Components (In order of importance)
1
2
3
4
5
6
7
8
9

24.2 Criticality Priority 1 System

This section ranks each system’s components in order of criticality, supplying the information that each system will require to bring it back online. First, vendor and model information, serial numbers and other component specific information will be gathered. Each component’s runbooks or Standard Operating Procedure (SOP) documents are attached as appendices at the end of the document

Each component must have a runbook or SOP document associated with it as below:

EXAMPLE:

System Name <<State the name of the IT System here>>
Component Name <<State the name of the specific IT Component here>>
Vendor Name <<State the name of the IT Component’s vendor here>>
Model Number <<State the name of the IT Component’s model number here>>
Serial Number <<State the name of the IT Component’s serial number here>>
Recovery Time Objective <<State the IT Component’s Recovery Time Objective here>>
Recovery Point Objective <<State the IT Component’s Recovery Point Objective here>>

 

Title: Standard Operating Procedures for <<Component Name>>
Document No.: <<Number of the SOP document>>

a) PurposeThis SOP outlines the steps required to restore operations of XXb) Scope

This SOP applies to the following components of XX

  • Web server
  • Web server software
  • Application server
  • Application server storage system
  • Application server software
  • Application server backup
  • Database server
  • Database server storage system
  • Database server software
  • Database server backup
  • Client hardware
  • Client software

c) Responsibilities

The following individuals are responsible for this SOP and for all aspects of the system to which this SOP pertains:

  • Edit this list as required
  • SOP Process:                      << SOP Owner>>
  • Network Connectivity:     <<Appropriate Network Administrator>>
  • Server Hardware:             <<Appropriate Systems Administrator>>
  • Server Software:               <<Appropriate Application Administrator>>
  • Client Connectivity:          <<Appropriate Network Administrator>>
  • Client Hardware:              <<Appropriate Helpdesk Administrator>>
  • Client Software:                <<Appropriate Helpdesk Administrator>>

For details of the actual tasks associated with these responsibilities, refer to section h) of this SOP.

d) Definitions

This section defines acronyms and words not in common use:

  • Edit this list as required
  • Document No.:  Number of the SOP document as defined by [insert numbering scheme]
  • Effective Date:   The date from which the SOP is to be implemented and followed
  • Review Date:      The date on which the SOP must be submitted for review and revision
  • Security Level:    Levels of security are categorized as Public, Restricted, or Departmental
  • SOP:                      Standard Operating Procedure

e) Changes Since Last Revision

  • Add to this list as required
  • << Nature of change, date of change, individual making the change, individual authorising the change>>

f) Documents/Resources Needed for this SOP

The following documents are required for this SOP:

  • Add to this list as required
  • Document

g) Related Documents

The following documents are related to this SOP and may be useful in the event of an emergency. Their documents below are hyperlinked to their original locations and copies are also attached in the appendix of this document:

  • Add to this list as required
  • Document

h) Procedure

The following are the steps associated with bringing <<Component Name>> back online in the event of a disaster or system failure.

Security Level: << Public, Restricted, or Departmental (the specific department is named).>> Effective Date: <<The date from which the SOP is to be implemented and followed>>
SOP Author/Owner: SOP Approver: Review Date: <<The date on which the SOP must be submitted for review and revision>>
Step Action Responsibility
1 <<Step 1 Action>> <<Person/group responsible>>
2
3
4
5
6
7

24.3 Criticality Priority 2 System

Repeat as above for as many systems as the enterprise makes use of.

 

25-Plan Testing & Maintenance

While efforts will be made initially to construct this DRP is as complete and accurate a manner as possible, it is essentially impossible to address all possible problems at any one time. Additionally, over time the Disaster Recovery needs of the enterprise will change. As a result of these two factors this plan will need to be tested on a periodic basis to discover errors and omissions and will need to be maintained to address them.

26-Maintenance

The DRP will be updated annually or any time a major system update or upgrade is performed, whichever is more often. The Disaster Recovery Lead will be responsible for updating the entire document, and so is permitted to request information and updates from other employees and departments within the organization in order to complete this task.

Maintenance of the plan will include (but is not limited to) the following:

  1. Ensuring that call trees are up to date
  2. Ensuring that all team lists are up to date
  3. Reviewing the plan to ensure that all of the instructions are still relevant to the organization
  4. Making any major changes and revisions in the plan to reflect organizational shifts, changes and goals
  5. Ensuring that the plan meets any requirements specified in new laws
  6. Other organisational specific maintenance goals

During the Maintenance periods, any changes to the Disaster Recovery Teams must be accounted for. If any member of a Disaster Recovery Team no longer works with the company, it is the responsibility of the Disaster Recovery Lead to appoint a new team member. 

27-Testing

In order to be effective, regular testing of the disaster recovery plan should be undertaken. It is only during testing that parameters such as the RTO and RPO can be confirmed as appropriate and to ensure that the data backups are wholly capable of providing the required level of recovery. It is not sufficient to assume that just because data is being backed up that it will be recoverable.XX is committed to ensuring that this DRP is functional. The DRP should be tested annually in order to ensure that it is still effective. Testing the plan will be carried out as follows:

XX  will employ the following to test the DR:

  1. Walkthroughs– Team members verbally go through the specific steps as documented in the plan to confirm effectiveness, identify gaps, bottlenecks or other weaknesses. This test provides the opportunity to review a plan with a larger subset of people, allowing the DRP project manager to draw upon a correspondingly increased pool of knowledge and experiences. Staff should be familiar with procedures, equipment, and offsite facilities (if required).
  2. Simulations– A disaster is simulated so normal operations will not be interrupted. Hardware, software, personnel, communications, procedures, supplies and forms, documentation, transportation, utilities, and alternate site processing should be thoroughly tested in a simulation test. However, validated checklists can provide a reasonable level of assurance for many of these scenarios. Analyze the output of the previous tests carefully before the proposed simulation to ensure the lessons learned during the previous phases of the cycle have been applied.
  3. Parallel Testing– A parallel test can be performed in conjunction with the checklist test or simulation test. Under this scenario, historical transactions, such as the prior business day’s transactions are processed against preceding day’s backup files at the contingency processing site or hot site. All reports produced at the alternate site for the current business date should agree with those reports produced at the alternate processing site.
  4. Full-Interruption Testing– A full-interruption test activates the total DRP. The test is likely to be costly and could disrupt normal operations, and therefore should be approached with caution. The importance of due diligence with respect to previous DRP phases cannot be overstated.

Any gaps in the DRP that are discovered during the testing phase will be addressed by the Disaster Recovery Lead as well as any resources that he/she will require.

There may be certain elements that are difficult to fully test in this situation, but these will also only be identified during testing.

 

28-Call Tree Testing

Call Trees are a major part of the DRP and XX requires that it is tested annually in order to ensure that it is functional. Tests will be performed as follows:

  • Disaster Recovery Lead initiates call tree and gives the first round of employees called a code word.
  • The code word is passed from one caller to the next.
  • The next work day all Disaster Recovery Team members are asked for the code word.
  • Any issues with the call tree, contact information etc will then be addressed accordingly.

29-Policy and Procedure review

All policies including the DR Plan will be reviewed when there are changes in employment law that are relevant, where there is a change in the business need or when feedback from HR, line managers or Trade Unions suggest that the policy is either out of date or unfit for purpose. 

30-Ownership and Revision

This Plan is owned by the Board of Directors of the Business who has delegated this task to the Chief Information Security Officer or other designated person. This policy shall be revised once in two years by the CISO or other designated person and every time that the Board of Directors of the Business decides to do so.

Version Control

Title Disaster Recovery Plan
Description Policy and Process
Created By Xapads Media Pvt. Ltd, 5th Floor, Windsor IT Park, Tower B, Plot No, A1.
Date Created 14/09/2023
Maintained By Xapads Media Pvt. Ltd,
Version Number Modified By Modifications Made Date Modified Status