Dynamic Root Cause Analysis

Finding a root cause of the line-level stop in a chain of machines requires deeper knowledge about the statuses of individual machines, but brings into light very valuable insights.

We studied many different lines, machines and production processes in order to create a powerful root-cause analysis engine in PackOS. Now, we will need a little of your knowledge to configure it properly. The article below describes various downtime types of machines, and how this affects the algorithms for root cause search. For this mechanism to work properly, you will need to make sure that states of machines (like Holdup, Idle or Failure) in PackOS are classified correctly, and reflect the real-life situation.

All of the machines defines two properties for each stop:

  1. What is the stop on the machine itself

  2. What is the potential root cause

While the first property - stop of the machine itself, is simply the result of rule evaluation (https://ilabo.atlassian.net/wiki/spaces/LW/pages/1795424257), the 'potential root cause’ is a result of a search on stoppages of external machines.

No root cause

First, note the following states cannot have a root cause, or be the root cause:

  • Work

  • Changover

  • Off

  • Inactive

  • NoLiveData

If any such states is encountered, it’s skipped and the search continues

Internal root cause

Failure state is usually directly identified by the sensor on the machine itself. It means that something prevents the machine from normal operation (e.g. a fault of some part). Such downtime becomes a potential root cause for this, or other machines.

Lack of Components (also called Material Shortage) state is usually directly identified by the sensor on the machine itself. And does not search for an external root cause. This stop means that the machine cannot produce because some component (e.g. caps) is missing. Such downtime becomes a potential root cause for other machines.

Stop by operator state is a result of manual operator intervention in machine operation. It is usually done on purpose (e.g. to solve an ongoing failure). That’s why, when PackOS spots a pattern - a Failure followed immediately by a Stop by operator, the former becomes the root cause of the latter.

Note a different behaviour on the line level:
Because it’s an important information whether the line is waiting in a Failure, or is Stopped by an operator while the failure is being investigated, the line will show both statuses on the line-level. Instead of blindly following the root cause like for any other case.

External root cause

Holdup & Idle on the machine signify that the root cause is outside of the machine.

  • Holdup - would appear if the machine cannot operate because the output has been blocked (usually by queued products). Indicate that the root cause is downstream from the machine.

  • Idle - would appear if there is no input into the machine (e.g. no bottles at the entry of the filler). Indicate that the root cause is upstream from the machine.

Note the difference between ‘Idle’ and ‘Lack of Components’.
The first looks for an external problem (on a different machine in the flow), while the second assumes it’s a direct infeed to the machine (not explicitly monitored as a separate machine), and does not look for an external root cause.

There are two critical pieces of data for a successful root cause search:

1. Graph of connected machines

Signifies the active set of machines and connections between them. Only the machines in the active flow will be searched for the root cause. And the set of connections between them will determine the direction of the search.

The flow can be controlled either by a SKU line configuration, which will adjust the flow after order start:

Or by a pre-processing function https://ilabo.atlassian.net/wiki/spaces/LW/pages/1774420110 which can trigger a SetFlow command.

Or by a manual adjustment in settings:

2. In-sync history of machine downtimes

Let’s analyse the root cause search for a Holdup. Idle works exactly the same, but in the other direction.

To find a root cause for a Holdup, we will analyse all stoppages for machines downstream, and look for stoppages which can be classified as root cause (described above). All the other states are “transparent”. There is no “bouncing” the other way: if we are looking for the cause of a holdup, we never look at the machines before the Base Availability machine even if one of the downstream machines is in the Idle state.

If there was no breakdown at this point of time, the Holdup state would not find any root cause.

In order to take into account full buffers / shorter reaction times, we will take into consideration the entire period starting from the maximum reaction time (x) and ending in the moment when a stoppage began - or even a while after the stoppage started (y) which would let us spot a situation when a driver did not “report” a breakdown as fast as it should have:

You can define ‘X' and 'Y’ for each machine in machine settings:

The whole period (x+y) is analysed, and the first event in this period of time will take precedence and become a root cause:

In a string of many machines, it is vital to observe the relative delay of stoppages from the beginning of the observed reaction time (marked in pink +1s/+5s). Using that relative time, it is possible to indicate the stoppage that began first as the root cause because a stoppage that occurred relatively later could be a consequence of the problem rather than its cause.

The time that has passed starting from the set “reaction time” is more important than the order of machines:

Specifying buffer delays is especially important when working with long lines. In that case root causes can take some time to propagate through machines.

You can see the period of time searched in the Work Spectrum view, marked with dark bars on neighbouring machines