STAR TAP Problem Management/Reporting Procedures

    Statement of Purpose

    Problem management is the process of identifying and resolving network problems. The goal of problem management is to maintain the highest standard of reliability and availability possible to the STAR TAP access point. The following procedures are considered the main focus of the STAR TAP Network Operations Center's problem management services.

    Procedures

    In the event of an unscheduled problem or outage at the STAR TAP access point, the NOC will follow an interwoven set of procedures to facilitate quick resolution. They are problem alert, paging, tracking, problem identification and isolation, notification, and troubleshooting. Many of these tasks are enacted simultaneously as the STAR TAP NOC utilizes its many resources to help resolve the problem. If action or resolution is not found within accepted time intervals, problem escalation will be enforced to ensure that all available resources are utilized in the effort to restore the network.

    Problem Reporting

    The STAR TAP NOC has both proactive and reactive methods of identifying events affecting the performance of the network. NOC technicians are available twenty-four hours a day, seven days a week, at:

    Phone number: 317-278-6630
    Email: noc@startap.net
    Web based submittal form: noc.startap.net/probman.html

    Telephone contact is the preferred and most immediate means of reporting any type of network problem, question, or emergency. Calls are immediately logged as an incident in the trouble ticket system with event history, contact information, resolution details, and follow up procedures.

    Email to the NOC is checked continually day and night. Email submissions are either resolved with a direct response or developed into an incident for further follow up in the trouble ticket system. Web based submission forms are available for specific network systems and are automatically converted into a trouble ticket. They are registered in the STAR TAP Network Operations job queue for immediate attention.

    The trouble ticket system allows detailed information on each problem to be shared by NOC personnel. All team members maintain a general working knowledge of all open tickets even if their special technical concentration is not
    specifically involved. The STAR TAP NOC uses a nationwide paging system to ensure that any member of the team may be reached regardless of their location.

    Problem Alert

    The STAR TAP NOC uses multiple tools and procedures in a front line, proactive approach towards the detection of potential network failures. The NOC employs multiple network monitoring programs running across several platforms. The variety and combination of programs helps insure strict and redundant monitoring of the network resources.

    The redundant monitoring tools allow the NOC to properly perform its network responsibilities. Multiple graphic summaries of network status, and device specific detailed statistical information provide a built in redundancy that facilitates both immediate and appropriate action by NOC personnel. NOC monitoring procedures provide accurate problem reporting, assistance in effective troubleshooting, and the development of procedures to anticipate and prevent future events affecting network availability.

    Once the STAR TAP NOC is alerted to a problem on the network, it begins a highly structured set of procedures towards problem resolution.

    Problem Assignment and Paging

    The STAR TAP NOC assigns problems to its engineering staff via a Round Robin method. In effect, the engineers are assigned problems when it is their turn, with each engineer getting the same number of turns. This system is used during normal business hours, Monday through Friday, 8:00 am to 5:00 pm (EST). After hours and on weekends, the problems are assigned to a designated on call engineer. This responsibility rotates between the engineers on a week-by-week basis. The STAR TAP NOC technicians page an engineer when assigning a problem to them.

    The STAR TAP NOC employs a strict paging policy that is enforced and followed 24 hours a day, seven days a week. At the first determination of a problem within the STAR TAP Network, a NOC technician will page the designated on call engineer. At the same time, NOC technicians will begin the tracking and notification processes, and assist the engineer in the problem identification and isolation process.

    The paging procedure is:

    1. Page primary on call engineer. If no response in 7 minutes, then...
    2. Page primary on call engineer again. Also page secondary on call engineer. The first engineer to call in takes primary ownership of the problem.
    3. If there is still no response in another 7 minutes, the problem is escalated to the Manager of Engineering, and the Manager of Operations.

    Upon calling in, the engineer is informed of the problem or failure and is provided with all supporting information. At this point a strategy is decided upon and documented. It is required that engineers continually update the NOC technicians so timely and accurate status notifications can be sent to affected parties.

    If the problem is not resolved within one hour, the Engineering Manager must be notified. At this time, it is the responsibility of the Engineering Manager to contact appropriate parties within the STAR TAP Network administration, and with Indiana University.

    Tracking

    At the onset of problem determination, a Trouble Ticket will be opened by a STAR TAP NOC technician. This will include all relevant information relating to the problem. The intermediate steps of tracking will include comprehensive updates of related information as it becomes available. This will provide a detailed chronology of the problem, including coordination efforts, from start to finish. Upon resolution, an incident is only closed after all related information is compiled. This includes detailed problem solving and resolution summaries from STAR TAP engineers, related vendors, or personnel from within other parts of the network. Following closure, the incident is available as a future resource for similar problems. Closed incidents are reviewed on a weekly basis for training purposes and quality assurance.

    Problem Identification and Isolation

    Once a network problem has been determined, the STAR TAP NOC technicians will utilize their tools and network expertise to help identify and isolate the problem. Through the paging process, the STAR TAP NOC engineers will take over primary problem identification and isolation responsibilities. In conjunction with the engineers, the STAR TAP NOC technicians will continue to help in whatever manner necessary until the problem is identified.

    Notification

    To ensure proper communication during network problems, the STAR TAP NOC will utilize several methods of information dissemination. Notification of the problem will be sent via email to an appropriate STAR TAP listserv.

    Notification will be sent out in various phases. They are:

    Initial Status Report: This will be performed as soon as a problem has been reported, and a problem ticket is opened. Notification may not initially identify the cause or source of difficulty, but will report what network components are affected, the status of their functionality, and the scope of the outage in relation to the STAR TAP network as a whole.

    Identification: This phase will state the cause and source of the problem (if not already related in the Initial Status Report), and what course of corrective action is being followed. An estimated time of resolution will be given, if at all possible.

    Updates: Periodic updates will be given once an hour until problem has been resolved. Any new information, milestones, or setbacks will be included.

    Closure: Upon closure, a resolution synopsis will be prepared and distributed immediately. This notice will include details regarding final resolution. Any other important pieces of information will also be disclosed. Review of the completed Trouble Ticket will be available upon request.

    Troubleshooting

    It is the primary responsibility of the STAR TAP NOC engineers to troubleshoot problems on the STAR TAP network. However, this is often a collaborative effort with our vendor partners in support of the STAR TAP Network. Joint problem solving and coordination procedures have been established with the related vendors. Each maintains their own Trouble Ticket system, with information shared between parties in a collaborative effort to resolve the problem. Once a Trouble Ticket is opened with a vendor, NOC technicians contact the appropriate engineers and support personnel throughout the STAR TAP network and inform them of the events and procedures relating to the problem.

    Escalation

    Once a problem is recognized, and support personnel notified, a Trouble Ticket is created. At this time, the problem is assigned an appropriate criticality. This applies to any failure or degradation in service to any resource within the STAR TAP Network. The incident is colored coded to designate this criticality:

    · Red (action needed within 0-59 minutes)
    · Yellow (action needed within 1-48 hours)
    · Green (action needed within 48-72 hours)
    · Blue (no action is needed)

    The STAR TAP NOC will pay strict attention to the status designated to each open Trouble Ticket, and will act immediately as escalation is needed.

    An incident designated code red is when the network, or a key network resource is down and unavailable. This is a serious problem and requires immediate action. Please notify both on call engineer and the Engineering Manager. If the problem is not acted upon within one hour and a status determined, the Engineering and Operations Managers must be notified. At this time, it is the responsibility of the Engineering Manager to contact the appropriate parties within the STAR TAP Network administration and Indiana University.

    A yellow designation assumes that the network or resource within is suffering from some sort of unacceptable degradation, but is not completely down. It is a matter given high priority, and requires action and status report within 48 hours. A yellow coded ticket is escalated to red if action has not been taken after this designated time frame.

    A green coded ticket relates to a network problem or situation that does not have a major impact on the STAR TAP Network as a whole. However, it is a matter that does demand action within two to three days. If appropriate action is not enacted within this time, or a status report given, it will be escalated to code yellow.

    Blue tickets are given this designation when there is no further action required in the problem resolution cycle. Most likely, it is still open to collect further information regarding the nature of the problem or resolution, or as a means of reminder to observe a newly repaired STAR TAP resource, etc.

    Tickets will also be deescalated from one code to another as deemed appropriate via communication between STAR TAP technicians, engineers, and support vendors, all within the problem resolution cycle.

     



University Information Technology Services
UITS
Home
Telecommunications Services
Telecom
Services
Network Operations Center Home
NOC
Services