I was recently asked to investigate two OT / SCADA networks one plant network supporting distributed SCADA /PLC and a second 8 node control network linking PLC to RIO. The network system was suffering from unpredictable network behaviour that had impacted operation - never a great selling point for a new SCADA deployment! Following the Initial conversation and a paper review of related network schematics it became evident that finding the root cause of the issue would be more of a challenge than usual. With only one exception, the entire backbone OT network was unmanaged.
OT systems are rarely designed to perform poorly yet OT networks are rarely designed for optimal availability and performance. This conundrum is probably a result a well-intentioned desire to achieve a short term commercial goal over a longer-term high-availability goal. I believe it is achieving the latter that is more likely to impress.
IT4A recommend, for good reason that I will explain, that network backbones and other key areas of inter-connection / access are served by managed switches. Whether you go for managed switches that deliver blinding performance, the highest security or simply that provide an insight into how the network is functioning, comes down perceived threat and to individual appetite for risk and, of course, budget. IT4A have some resources that may help with decision making.
Threat & Vulnerability
Network threats are not all Cyber related. A threat relates to an asset and more specifically the vulnerabilities that exist within that asset. A vulnerability could be the selection of a link speed that could not support peaks of demand; or it could be compromise through continued used of a legacy operating system without mitigation. Either way, if the respective threat was to play out, it could bring the respective OT system to its knees.
In this case, the asset was a SCADA controlled production line, the threat was unreliable network access, the vulnerability was using featureless, unmanaged switches in the design.
In general terms risk is the likelihood of a vulnerability being exercised. We know from experience that unreliable network access tends to be a result of one or more of the following:
- Auto-negotiation miss-match
- Sub-standard cabling
- Link capacity exceeded – bottlenecks
- Hardware failures
- Cyber attack
- Uncontrolled broadcasts control resulting from loops and/or ineffective segmentation.
The greater the potential impact of risk the higher the likelihood steps will/should be taken to mitigate it – makes sense? The more critical the infrastructure the more likely the asset owner will want to reduce the risk to low and avoid the impact as far as reasonable possible. In the Nuclear and other CNI sectors this is referred to as ALARP (As Low As Reasonably Practicable). In networking risks are mitigated through the effective deployment of product features; most of these features exist only in managed switches.
Managed Switch Investigation
In this case only one managed switch had been deployed in the Plant LAN. From the maangement console we could quickly determine whether client auto-negotiation had completed successfully and that other potential sources of link errors were not present. We did discover a duplex miss match that related to what had been a problematic PC, the issue was resolved through manual setting and the errors that were counting up stopped – a quick win! In this case the 'must have' feature on the managed switch - port mirroring - was available. Port mirroring provides the ability to copy data from one or more ports to a spare 'mirror port' from where an analyser could capture and allow us to investigate what’s going on.
Next we set up a network monitor to determine how much broadcast traffic was present and how/if it was being controlled. Capturing broadcast traffic is easy as it appears on every port in the sub-network. Excessive broadcast traffic can impact the performance of lower spec devices such as PLCs and RTU, especially older devices, as they must process each frame to determine if it is a relevant one for them. This necessary diversion can hit the CPU’s ability to perform its day job of controlling the process.
Our monitor identified unwanted broadcast and multicast traffic landing on the PLC interface; whether this was the cause of performance issues was unknown. In terms of good practice network design, a single broadcast domain sharing autonomous applications such as video and control traffic should be avoided. Managed network features such as Virtual LANs can provide separate out broadcast areas (domains) for each application. Network segmentation optimises performance and increases security by limiting access.
A further use of the port-mirror feature allowed us to capture data from each managed interface for further off-line analysis. More feature rich managed products would have also provided information on link usage stats, cpu loading and more.
Unmanaged Switch Investigation
The lack of any managed interfaces at the control / field network layer meant a different and highly constrained approach was required if we were to answer at least some of the questions that would lead to determination of root cause. Rather than identify what was wrong with the network and fix it, the approach needed was to prove what did work correctly narrowing down the areas where problems might lurk. This is more intrusive and far more time consuming; in the time available we needed some quick wins achievable within hours not days. We decided the best use of time was to first bench mark the performance of the network. This would either give us confidence or point towards a bottleneck that we could focus in on. In practice, this meant creating a performance baseline across the unmanaged network using 2 IT4A test PCs able to make controlled transfers that were captured and later analysed.
The result of our series of tests, from source to network extremity, established there were no performance bottlenecks that could impact the throughout or response time across the control network.
The end of the day had been reached, within the managed network one problem had been identified and resolved quickly and confidence on all cabling gained. As for the unmanaged network our performance baseline found no fault with either switch operation or inter-switch performance. We did identify issues relating to network segmentation and poor security practices and flagged them. Whether or not there are cable issues in the field or miss-matches in auto-negotiation we simply don’t know.
I feel we achieved as much as we could; we identified and solved a hard issue, determined performance was normal and identified opportunities for improvement. There was however a frustration (mainly mine) that a more conclusive assessment of network health could not be attained; this was due to the original unmanaged design that was out of my control. Short, medium and long-term recommendations were given to enhance the current network design and tips provided if performance degrades on the unmanaged LAN.
Of the many resources that exist to assess risk of a control philosophy, I am left wondering how many consider network switch selection in the assessment?
We had time to consider one network threat to reliable operation here - there are many others. Designing OT networks, especially critical network infrastructures, without considering threat, vulnerability or risk is to be avoided.
No surprises here, if you are thinking about updating exiting or implementing a new OT/SCADA Network please use managed swicthes in the backbone, whether you invest in resilience or security comes down to your appetite for risk. All OT networks need to provide reliable connectivity to the overarching process. By all means save a few ££ on product selection using lowest cost unmanaged hardware in the panels to expand port counts, but dont compromise your system by using them in the backbone.
Skills & Competence
Companies seem to feel their engineers, on top of their day job, have all the network & cyber skills and knowhow to support a valueless ‘white goods’ type sale; is this a realistic expectation?
If we are to avoid the potentially catastrophic consequences of a threat being playing out, I feel we should be working and collaborating as part of our Customer’s team. At IT4A we have the skills and resources to design in and supply great products into projects where the desired outcome is sustainable.
Secure by Design
I believe a sustainable solution will include aspects of:
- Scope definition through collaboration
- Threat & risk mitigation by informed design
- Secure product selection
- Hardened configuration & test
- Integration, commissioning & baseline.
- Effective reliable documentation
- Good administration
- Awareness through monitoring events and alerts
- Remedy through skills development / knowledge transfer
- Access to experienced support & maintenance
- Training to perform a role on the network not the product.
It is not unusual for a network specifications to be limited to specifying technology i.e. Ethernet and possibly speed.
Until behaviours change the networks that support our critical infrastructure will remain unnecessarily vulnerable. Look first at what exists and how to protect what you have today – many lessons will be learned. Then take these lessons into your forward-looking strategy. If you want advice on product selection or supply or to know more about the protection of Operational Technology / SCADA systems please make contact.