|
|
|
|
|
|
|
|
![]() |
||||||
|
||||||
TransPAC Problem Management/Reporting Procedures
Statement of PurposeProblem management is the process of identifying and resolving network problems. The goal of problem management is to maintain the highest standard of reliability and availability possible to the TransPAC Network. The following procedures are considered the main focus of the TransPAC Network Operations Center's problem management services. ProceduresIn the event of an unscheduled problem or outage in the TransPAC Network, the NOC will follow an interwoven set of procedures to facilitate quick resolution. They are problem alert, paging, tracking, problem identification and isolation, notification, and troubleshooting. Many of these tasks are enacted simultaneously as the TransPAC NOC utilizes its many resources to help resolve the problem. If action or resolution is not found within accepted time intervals, problem escalation will be enforced to ensure that all available resources are utilized in the effort to restore the network. Problem ReportingThe TransPAC NOC has both proactive and reactive methods of identifying events affecting the performance of the network. NOC technicians are available twenty-four hours a day, seven days a week, at: Phone number: 317-278-6630 Problems are immediately logged as an incident in the trouble ticket system with event history, contact information, resolution details, and follow up procedures. Email to the NOC is checked continually day and night. Email submissions are either resolved with a direct response or developed into an incident for further follow up in the trouble ticket system. Web based submission forms are available for specific network systems and are automatically converted into a trouble ticket. They are registered in the TransPAC Network Operations job queue for immediate attention. The trouble ticket system allows detailed information on each problem to be shared by NOC personnel. All team members maintain a general working knowledge of all open tickets even if their special technical concentration is not specifically involved. The TransPAC NOC uses a nationwide paging system to ensure that any member of the team may be reached regardless of their location. Problem AlertThe TransPAC NOC uses multiple tools and procedures in a front line, proactive approach towards the detection of potential network failures. The NOC employs multiple network monitoring programs running across several platforms. The variety and combination of programs helps insure strict and redundant monitoring of the network resources. The redundant monitoring tools allow the NOC to properly perform its network responsibilities. Multiple graphic summaries of network status, and device specific detailed statistical information provide a built in redundancy that facilitates both immediate and appropriate action by NOC personnel. NOC monitoring procedures provide accurate problem reporting, assistance in effective troubleshooting, and the development of procedures to anticipate and prevent future events affecting network availability. Once the TransPAC NOC is alerted to a problem on the network, it begins a highly structured set of procedures towards problem resolution. Problem Assignment and PagingThe TransPAC NOC assigns problems to its engineering staff via a Round Robin method. In effect, the engineers are assigned problems when it is their turn, with each engineer getting the same number of turns. This system is used during normal business hours, Monday through Friday, 8:00 am to 5:00 pm (EST). After hours and on weekends, the problems are assigned to a designated on call engineer. This responsibility rotates between the engineers on a week-by-week basis. The TransPAC NOC technicians page an engineer when assigning a problem to them. The TransPAC NOC employs a strict paging policy that is enforced and followed 24 hours a day, seven days a week. At the first determination of a problem within the TransPAC Network, a NOC technician will page the designated on call engineer. At the same time, NOC technicians will begin the tracking and notification processes, and assist the engineer in the problem identification and isolation process. The paging procedure is: Upon calling in, the engineer is informed of the problem or failure and is provided with all supporting information. At this point a strategy is decided upon and documented. It is required that engineers continually update the NOC technicians so timely and accurate status notifications can be sent to affected parties. If the problem is not resolved within one hour, the Engineering Manager must be notified. At this time, it is the responsibility of the Engineering Manager to contact appropriate parties within the TransPAC Network administration, and with Indiana University. TrackingAt the onset of problem determination, a Trouble Ticket will be opened by a TransPAC NOC technician. This will include all relevant information relating to the problem. The intermediate steps of tracking will include comprehensive updates of related information as it becomes available. This will provide a detailed chronology of the problem, including coordination efforts, from start to finish. Upon resolution, an incident is only closed after all related information is compiled. This includes detailed problem solving and resolution summaries from TransPAC engineers, related vendors, or personnel from within other parts of the network. Following closure, the incident is available as a future resource for similar problems. Closed incidents are reviewed on a weekly basis for training purposes and quality assurance. Problem Identification and IsolationOnce a network problem has been determined, the TransPAC NOC technicians will utilize their tools and network expertise to help identify and isolate the problem. Through the paging process, the TransPAC NOC engineers will take over primary problem identification and isolation responsibilities. In conjunction with the engineers, the TransPAC NOC technicians will continue to help in whatever manner necessary until the problem is identified. NotificationTo ensure proper communication during network problems, the TransPAC NOC will utilize several methods of information dissemination. Notification of the problem will be sent via email to an appropriate TransPAC listserv. Notification will be sent out in various phases. They are: Initial Status Report: This will be performed as soon as a problem has been reported, and a problem ticket is opened. Notification may not initially identify the cause or source of difficulty, but will report what network components are affected, the status of their functionality, and the scope of the outage in relation to the TransPAC network as a whole. Identification: This phase will state the cause and source of the problem (if not already related in the Initial Status Report), and what course of corrective action is being followed. An estimated time of resolution will be given, if at all possible. Updates: Periodic updates will be given once an hour until problem has been resolved. Any new information, milestones, or setbacks will be included. Closure: Upon closure, a resolution synopsis will be prepared and distributed immediately. This notice will include details regarding final resolution. Any other important pieces of information will also be disclosed. Review of the completed Trouble Ticket will be available upon request. TroubleshootingIt is the primary responsibility of the TransPAC NOC engineers to troubleshoot problems on the TransPAC network. However, this is often a collaborative effort with our vendor partners in support of the TransPAC Network. Joint problem solving and coordination procedures have been established with the related vendors. Each maintains their own Trouble Ticket system, with information shared between parties in a collaborative effort to resolve the problem. Once a Trouble Ticket is opened with a vendor, NOC technicians contact the appropriate engineers and support personnel throughout the TransPAC network and inform them of the events and procedures relating to the problem. EscalationOnce a problem is recognized, and support personnel notified, a Trouble Ticket is created. At this time, the problem is assigned an appropriate criticality. This applies to any failure or degradation in service to any resource within the TransPAC Network. The incident is colored coded to designate this criticality: · Red (action needed within 0-59 minutes) The TransPAC NOC will pay strict attention to the status designated to each open Trouble Ticket, and will act immediately as escalation is needed. An incident designated code red is when the network, or a key network resource is down and unavailable. This is a serious problem and requires immediate action. Please notify both on call engineer and the Engineering Manager. If the problem is not acted upon within one hour and a status determined, the Engineering and Operations Managers must be notified. At this time, it is the responsibility of the Engineering Manager to contact the appropriate parties within the TransPAC Network administration and Indiana University. A yellow designation assumes that the network or resource within is suffering from some sort of unacceptable degradation, but is not completely down. It is a matter given high priority, and requires action and status report within 48 hours. A yellow coded ticket is escalated to red if action has not been taken after this designated time frame. A green coded ticket relates to a network problem or situation that does not have a major impact on the TransPAC Network as a whole. However, it is a matter that does demand action within two to three days. If appropriate action is not enacted within this time, or a status report given, it will be escalated to code yellow. Blue tickets are given this designation when there is no further action required in the problem resolution cycle. Most likely, it is still open to collect further information regarding the nature of the problem or resolution, or as a means of reminder to observe a newly repaired TransPAC resource, etc. Tickets will also be deescalated from one code to another as deemed appropriate via communication between TransPAC technicians, engineers, and support vendors, all within the problem resolution cycle.
|