Everybody has troubles — especially IT professionals. But for obvious reasons, they don’t have the luxury to ignore them, run from them, put them off, or pass them on to somebody else. Doing so would undermine availability. Perhaps the availability of a server or a desktop, or even a workforce. No, in the world of information technology, we document our troubles. We track them. And we wrestle them to the ground to restore precious availability. We write down what we think is wrong, and we write down what we did to fix it. If we can’t find a permanent solution, we put a workaround in place. And we have to document everything, because if it’s not documented, it didn’t happen. It’s all part of trouble ticket excellence — and it’s a lot of work.
Everything is digital now, but in the early days of IT troubleshooting, engineers were given a card or paper slip that described the problem, its location, and other details. Until recently — and perhaps even now — repairmen of all sorts worked from paper tickets torn from service books. Small automotive repair shops, for instance, still use this method. You would be hard pressed to find any IT department still using the paper method for assigning and tracking trouble tickets. We are talking about IT professionals, after all.
A blog post by the customer support company Kayako tells the history of customer service, including the advent of trouble ticketing. Blogger Varun Shoor writes that Ron Muns founded the Help Desk Institute (HDI) in 1989. The aim was to give employees a central place where they could get technical support for their IT infrastructure. HDI continues to provide education, certification, and other membership benefits.
So who might use a trouble ticket? The answer is simple: everyone. When a system user calls customer support or the help desk to report a problem, the first-level (or Tier 1) support person will open a trouble ticket. The greatest skill required of Tier 1 personnel is the management of issues. They may ask the user many questions and may do basic technical troubleshooting. As they work, they are continually typing in the ticket. If they cannot solve the problem, they may forward the ticket to a Tier 2 engineer. Technical escalation may mean the ticket goes on to a Tier 3 engineer, or even to an equipment vendor, who is sometimes referred to as Tier 4.
Trouble tickets are used by departments called help desks, service operation centers (SOCs), or network operations centers (NOCs). Service personnel may be divided into groups based on level of support or the type of platform. But it’s not just technical people who use trouble tickets. Managers get involved when management escalation is required. And tickets are used to crunch numbers for reports. Company executives may want to see such key performance indicators (KPIs) as:
Just as technical maintenance can be viewed as either proactive or reactive, trouble ticket systems use two approaches to issues. A reactive ticket is generally one that is opened because an issue already exists. If it is an issue that affects just that user, then the ticket will be dealt with as a problem unique to that customer. But if there is a broader system outage that is known to the 1st-line tech or customer service rep, they may want to attach the ticket to an “event” ticket. This single ticket would then have a subordinate relationship to the event ticket, which may be described as master-slave, parent-child, or with some other terminology. In this case, when the master ticket is updated or closed, the same will happen to the subordinate ticket.
Automation is changing so much of the trouble management process.
Automation is changing so much of the trouble management process. Proactive tickets can be opened by techs who recognize a problem, but the more common scenario these days is that a ticket will be automatically opened and sent to the appropriate personnel when an issue occurs. The ticket generation could be created by a specific event, such as a connection going down, or it could be created once a certain parameter meets a predetermined threshold. For instance, when a set number of errors occur, the system could open a ticket. Or suppose traffic on a line becomes congested, or a link is no longer available. When the numbers become too large, automation will generate tickets.
Incident management is not the only role of ticket systems. They can also be used for administrative tasks, such as customer notification or tracking routine tasks. They can also be integrated with a company’s change control system. Change control, including methods of procedure, is a way to document moves and changes affecting hardware and software in a live system.
The potential for database integration is great. Many companies cleverly link trouble ticketing systems with their knowledge base. This way a technician or engineer can find clear procedures and solutions for dealing with the problems before them.
Trouble tickets are meant to address troubles, and those could be just about anything, but often impact availability of some system or another. The methods that IT professionals have developed to deal with these troubles have developed over time. It’s only natural that things break down, become obsolete, or need special attention. It can be overwhelming.
So how do you keep up? That’s the subject of an article in Techopedia called “IT Infrastructure: How to Keep Up”. The author deals with infrastructure management from the early days to the present. But whether we are talking about old-school replacement of vacuum tubes or futuristic analytics, we still need documentation.
What do we mean by “ticket excellence”? The term comes from a standard of problem documentation that was promulgated in the Sprint organization in the 1990s. Sprint had a document that gave clear instruction on how to write ticket notes, how to handle tickets, and how to do it all well. The specifics of the internal procedures are not important here. The idea is that any IT professional or organization that tracks problems in a ticket system should do so in an excellent way and according to the highest standards.
Here are some principles that perhaps should be included in any company’s “ticket excellence” standards:
We wrote recently in this space about service level agreements (SLAs). The primary focus of the article was the matter of downtime and uptime and what is promised to the customer. But SLAs could cover so much more. One area that SLAs cover is the amount of time required to fix particular issues.
To manage tickets properly, you need to prioritize them. Ticket systems do this by assigning a priority code to each ticket. There are different ways to do that. A system may include a word for each priority:
Or a ticket system might assign a priority number that will dictate how urgently tickets are worked:
These priorities should be clearly defined in the SLA. A P1 may be a critical issue that deals with a major availability event or is related to a VIP customer. A P5 may be an administrative task that could be done when the desk is quiet. Each of these may have time limits. A P1 may need to be resolved in 4 hours, a P2 in 8 hours, and so on. Any ticket that is not resolved within the SLA time requirement is considered “outside of SLA”. That’s when people can turn up the heat.
The way people turn up the heat is by using SLA-defined escalation.
The way people turn up the heat is by using SLA-defined escalation. This means that an issue is going further up the chain, either technically or in management. As for technical escalation, if a Tier 1 tech can’t handle a problem, he will send a ticket to a Tier 2 department or tech — with an appropriate explanatory handoff. If the Tier 2 guy can’t fix it, the ticket goes to Tier 3. Management escalation means that supervisors, directors — even vice-presidents or CEOs — get involved in a ticket. You may think that technical issues are about machines, but they are also about people.
That doesn’t prevent people from turning as much over to the machines as possible. We don’t use paper tickets, and IT companies around the world continue to look for ways to make the detection and resolution of IT problems as automatic as possible. With redundant systems and self-healing networks, many problems don’t even show up on a surveillance technician’s radar.
In a white paper by the company Wipro called Infrastructure Automation and Analytics, the author discusses the history of dealing with IT infrastructure:
Just as infrastructure management and maintenance required a lot of attention in the early days of IT, we can see the parallels in trouble documentation. Issues that might have been complicated and confusing long ago are handled much better now that we have so much experience with them. And many problems that required a lot of written analysis might be left to computers nowadays. The work of IT problem management continues to advance.
In an IT organization, ensuring high levels of availability of users and devices across the organization starts with trouble ticket excellence. If you can’t figure something out, it often helps to write it all out and analyze it. Trouble ticket excellence is a way that this analysis becomes a collaborative effort. It can be frustrating, and exhilarating, and sometimes it’s just plain fun! Anyone who wants to go into IT should know how to communicate. It’s extremely important. Documenting an issue with ticket excellence can make or break an IT business.
Alright, we admit that our company has a certain obsession with network availability we collectively call “uptime”. It’s even in our name. We’re totally committed to keeping services up and running for our clients. And while uptime is our best friend, we seem to spend a lot of time thinking about the enemy: downtime. We’ve […]
We have remarked several times in this space about the tremendous changes in the data center in such a short period of time. Not only are device footprints shrinking and functionalities converging, but the way that hardware and software are managed has become more comprehensive and streamlined. Now the same thing that has happening to […]
Uptime is a key performance indicator (KPI). Some would say it is the key performance indicator, the sine qua non, of productive computing. If you can’t keep your system operational, you have nothing. None of the many functionalities – the bells and whistles – matter one whit if your customers can’t access your site or service. The expectation in […]
After unabashedly extolling the virtues of redundancy in a recent article , you may be wondering why we would follow up with another post questioning whether sometimes too much (redundancy) was just too much. Credit fellow staffers for the suggestion that we revisit the issue. The problem was clearly a part of our initial research, and it deserves […]