How a reliability and automation engineer is turning alarm data into operational intelligence across complex energy facilities.

A SHALE Exclusive By Ellen F. Warren

Sandeep R. Kondaveeti, PhD, P.Eng., is a senior reliability and automation engineer with deep experience in the oil and gas industry, where he has worked across upstream and downstream operations to improve the safety, stability, and performance of large-scale process facilities. Trained as a chemical engineer, Kondaveeti earned his doctorate from the University of Alberta, focusing on advanced analysis and redesign of industrial alarm systems—work that would go on to influence how alarm management is practiced in operating energy assets.

Over more than fifteen years, Kondaveeti has held technical and leadership roles spanning industrial automation software environments, oil sands operations, and enterprise reliability functions. His career has been shaped by a recurring challenge in oil and gas control rooms: alarm systems that meet standards on paper, yet overwhelm operators during abnormal situations. Rather than treating alarms as static configuration artifacts, he has focused on understanding how alarm systems behave in real operating conditions, how operators respond, and how that information can be used to drive better engineering and operational decisions.

Kondaveeti’s innovative work has been widely published in peer-reviewed journals and international conferences, including original technical contributions that introduced quantitative methods for identifying nuisance alarms, graphical techniques for visualizing alarm behavior at scale, and automated reporting systems that embed alarm performance monitoring into routine operations. In industrial deployments—particularly within oil and gas facilities—these approaches have been used to reduce alarm load, improve operator effectiveness, and support safer and more reliable operations over time. Collectively, his work bridges research and practice, translating alarm data into actionable insight for engineers, operators, and leadership responsible for complex energy infrastructure.

 

ELLEN WARREN (EW): Sandeep, alarm management is a relatively specialized area within process control. What initially drew you to this problem during your doctoral research, and what convinced you it was important enough to pursue beyond academia and into operating oil and gas facilities?

SANDEEP KONDAVEETI (SK): What first drew me to alarm management during my doctoral research was my passion for data analytics and its ability to extract meaning from complex, noisy systems. I was fascinated by how modern process plants generate enormous volumes of data, yet operators are often forced to make decisions based on poorly structured or overwhelming alarm information.  Alarm systems bring together control theory, human factors, and data science, a natural fit for someone focused on turning raw data into actionable insight.

What convinced me this work had to go beyond academia was its direct industry relevance. While process control theory has advanced significantly over the past decades, what is implemented in industrial environments is only a small fraction of what is theoretically possible. This gap exists because real plants must deal with unreliable sensors, aging actuators, nonlinear dynamics, and highly coupled processes. Alarm systems reflect these realities more clearly than almost any other layer of control. I realized that more than just a theoretical optimization problem, improving alarm performance was a practical way to improve safety, reliability, and operator effectiveness in operating oil and gas facilities. That combination of analytical depth and tangible operational impact is what motivated me to take this work from the lab into live plants.

EW: You have spent much of your career working with alarm systems in oil and gas facilities. When you walk into a control room today, what are the most common alarm-related problems you still see—and why do they persist despite decades of standards and guidance?

SK: What I see in most control rooms today is that the core alarm problems are surprisingly consistent. The first is the absence of a well-defined alarm philosophy that truly aligns with standards such as ISA-18.2. Without that foundation, alarm limits, priorities, and classifications are set inconsistently, often based on habit rather than risk. The second issue is that many sites do not follow a true lifecycle approach to alarm management. Alarms are added during projects or troubleshooting efforts, but rarely reviewed systematically as the process, control strategy, or operating conditions evolve.

I also frequently see poor alignment between alarms and operator graphics. Inefficient or cluttered display themes make it difficult for operators to quickly interpret alarms in context, which defeats the purpose of having them in the first place. Another problem is that suppressed and shelved alarms are often poorly governed. Suppression becomes a workaround rather than a managed safety practice, creating blind spots during abnormal situations. These problems persist despite decades of standards and guidance because alarm management is still too often treated as a late-stage configuration task rather than a design discipline. The solution is not more rules, but earlier emphasis on alarm rationalization during project initiation and stronger education around the intent of the standards, not just their checklists. When alarm management is designed deliberately from the start, rather than patched in later, most of these chronic issues become preventable.

EW: In the oil and gas industry, alarm management is often framed as a compliance requirement tied to ISA-18.2 or internal audits. From your experience, what’s the practical difference between an alarm system that is compliant and one that actually supports operators during an upset?

SK: The practical difference lies in how operators comprehend and act during an upset. A compliant alarm system may meet documentation and audit requirements, but an effective alarm system is one that helps control room operators quickly understand what is happening based on the alarms they are presented with, identify the root cause in an efficient manner, and recover the process before further escalation occurs. In other words, compliance proves the system exists; performance is proven by how well it supports human decision-making under pressure. The real test of an alarm system is not whether it satisfies a standard, but whether it shortens the time between detection, diagnosis, and corrective action when the process is moving toward an unsafe state.

EW: Your work repeatedly highlights the interaction between alarms, operator actions, and advanced process control (APC). In real oil and gas operations, how do poorly designed alarm systems interfere with APC performance during abnormal events?

SK: From an independent protection layer perspective, the alarm system sits above basic and advanced control as the first layer that explicitly requires human intervention. In real operations, by the time an alarm is presented through a properly designed alarm system, advanced process control has often already been degraded, shed, or switched out of service automatically to protect the process. At that point, the operator becomes the primary stabilizing element.

Poorly designed alarm systems interfere with this handoff. When alarms are excessive, poorly prioritized, or lack clear meaning, operators struggle to distinguish between symptoms and root cause. Their actions—or delayed actions—can inadvertently disrupt a still-functioning APC layer, forcing it out of service prematurely or preventing it from recovering once conditions normalize. Instead of supporting APC during abnormal events, a bad alarm system competes with it for attention and control authority. The result is a feedback loop where degraded alarms lead to degraded operator response, which in turn accelerates the loss of APC performance and drives the process further from its optimal operating envelope.

EW: You introduced graphical tools such as high-density alarm plots and alarm similarity maps to analyze large alarm datasets. Why were these kinds of visual tools especially important for complex oil and gas facilities with thousands of configured alarms?

SK: In these types of facilities,  alarm data quickly becomes too large and complex to interpret using tables or event logs alone. As the saying goes, a picture is worth a thousand words—and in this context, it is often worth thousands of alarm records. Visual tools such as high-density alarm plots and alarm similarity maps transform massive alarm datasets into patterns that engineers can understand at a glance, revealing clustering, repetition, and escalation behavior that would otherwise remain hidden.

These visualizations make it much easier to identify dominant “bad actor” alarms and groups of alarms that behave similarly during events. Instead of manually sifting through tens of thousands of records, control engineers can see where the real problems lie and focus their effort on rationalization and design improvements. I have found that in complex facilities, these graphical tools are essential for turning alarm data into actionable insight.

EW: Chattering and related alarms are widely recognized problems in energy facilities, yet they are often addressed using rules of thumb. What changed when you began quantifying alarm chatter using run-length distributions instead of relying on generic thresholds?

SK: Using run-length distributions fundamentally changed the problem from a subjective judgment to an objective measurement. Earlier approaches relied on rules of thumb—such as declaring an alarm “chattering” if it repeated a certain number of times in a minute—which were heuristic and often arbitrary. By analyzing historical alarm data and characterizing each alarm through its run-length distribution, every configured alarm could be quantitatively ranked based on how frequently and how quickly it reoccurred. This removed the guessing game and replaced generic thresholds with data-driven metrics, allowing engineers to identify and prioritize true nuisance alarms systematically rather than by intuition alone.

EW: A major step in your work was moving from one-time alarm studies to automated, periodic alarm performance reporting deployed in operating oil and gas facilities. Why was automation essential for making alarm management sustainable in an industrial environment?

SK: Automation was essential because alarm management is not a one-time engineering task; it is a continuous lifecycle activity. While standards such as ISA-18.2 call for periodic assessment, many facilities avoid doing it simply because manual analysis of alarm data is time-consuming and resource-intensive. By automating alarm performance reporting, the effort required to monitor, diagnose, and prioritize problems is dramatically reduced. This shifts alarm management from an occasional project to a routine operational practice, making it practical for facilities to follow the lifecycle approach instead of postponing or abandoning it due to workload constraints.

EW: In your automated reporting framework, you didn’t just track alarm counts, you included suppressed alarms, standing alarms, safety bypasses, and the duration of unresolved issues and exceptions. How does this kind of reporting change how oil and gas organizations prioritize risk and maintenance work?

SK: Including these factors changes alarm reporting from a performance dashboard into a risk dashboard. Each of these elements represents a different type of hidden exposure that can easily be overlooked by control room operators. Safety bypasses are particularly critical because they disable an independent protection layer that sits above the alarm system itself. Poorly managed suppressions can blindside operators by masking abnormal conditions, while long-standing alarms often signal underlying equipment or process problems that have been normalized over time.

By consolidating all of these indicators into an automated report, organizations can see technical debt and operational risk in one place instead of scattered across systems and shift logs. This shifts prioritization away from reacting to the loudest alarm problems and toward addressing the most consequential ones, allowing maintenance and operations teams to focus on issues that truly increase safety and reliability risk.

EW: You have documented cases where alarm analysis led to concrete engineering changes including retuning control loops, removing redundant alarms, improving condition-based alarming, or identifying unreliable instrumentation, for example. What distinguishes alarm data that leads to real action from data that simply gets reviewed and forgotten?

SK: Alarm data leads to real action when it is connected to operator behavior and process recovery, not just counted and summarized. Reviewing alarm rates or top tags alone often results in reports that look informative but do not point clearly to what should change. What makes the data actionable is analyzing alarms together with operator actions and subsequent process response—whether an alarm triggered a meaningful intervention, how long it took for the condition to return to normal, and whether the same pattern repeats.

When that linkage is made, the data naturally drives engineering decisions: retuning a loop that never stabilizes after an alarm, removing redundant alarms that never prompt action, improving condition-based alarming where operators consistently respond late, or flagging instruments whose noise dominates operator workload. Actionable alarm data tells a cause-and-effect story; forgotten alarm data is just a list of symptoms.

EW: Oil and gas operators often manage fleets of similar assets such as pads, trains, compressor stations, or units. How can alarm–action patterns be used to identify best practices across assets without ignoring site-specific operating realities?

SK: Alarm–action patterns allow similar assets to be compared based on how operators respond to the same abnormal situations. Differences in response time and recovery behavior reveal which sites handle upsets more effectively, turning those patterns into candidates for best practice rather than relying on anecdotes. At the same time, the comparison must account for local constraints such as equipment condition and operating limits. This approach supports standardization where it makes sense, without ignoring site-specific realities.

EW: Across your research, industrial implementations, and long-term responsibility for operating facilities, what do you believe has had the greatest impact on how oil and gas organizations understand, manage, and sustain alarm system performance—and how are you now working to address the technical or organizational gaps that still remain?

SK: The greatest impact has come from shifting alarm management from a one-time engineering exercise into a measurable, continuous performance discipline. By turning alarm data into objective metrics and automated reports, organizations began to see alarm systems as assets that require ongoing monitoring, much like control loops or rotating equipment. This changed how alarm performance is understood and sustained—moving it out of project mode and into normal operational accountability.

The gaps that remain are less technical and more organizational: alarm management still competes with production priorities, and ownership is often unclear once a project is complete. My current focus is on closing that gap by embedding alarm performance into routine operations—linking alarm behavior to operator response, maintenance workload, and risk exposure—so that alarm system health is managed proactively rather than reactively. The goal is to make strong alarm management a default part of day-to-day operations.

kondaveeti-sandeep

Sandeep is a senior reliability and automation engineer with more than 15 years of experience across oil and gas, power generation, and process automation, and a PhD focused on industrial alarm systems. 

Keep In Touch with Shale Magazine

As the new era of energy unfolds, you can bet we’ll be the boots on the ground to keep you informed. Subscribe to Shale Magazine for sharp insight into the arenas that matter most to your life. And don’t forget to listen to our riveting podcast, The Energy Mixx Radio Show, where our very own Kym Bolado interviews the most extraordinary thought leaders, business innovators, and industry experts of our time.

Previous articleLunar Nuclear Power System Could Reach Moon by 2030
Next articleAI Natural Gas Demand Is Reshaping Power Markets
Amanda Jenkins
Amanda Jenkins is Vice President & Washington Bureau Chief at Energy Network Media Group, where she leads digital publishing operations and website management across the company’s media platforms. She oversees content workflows, platform optimization, SEO performance, and multimedia execution, ensuring content is produced efficiently and presented with accuracy and credibility. With a background in journalism and digital communications, Amanda brings a practical, systems-driven approach to managing media operations across digital and broadcast channels. While her role is focused on operational leadership, she remains closely connected to the editorial process and continues to contribute written and video-based explainers, reflecting her ongoing passion for writing, education, and clear reporting.

LEAVE A REPLY

Please enter your comment!
Please enter your name here