| ... | @@ -27,7 +27,7 @@ Fault tolerance —both hardware and software— is achieved through some kind o |
... | @@ -27,7 +27,7 @@ Fault tolerance —both hardware and software— is achieved through some kind o |
|
|
|
|
|
|
|
Reliability building block goal is to improve the reliability aspects of the universAAL platform. Therefore the reliability building block is a vertical layer cross over all layers of universAAL, especially in the Middleware. This can be done by dealing with to major challenges of reliability and enhance the system efficiency. The first action point, the creation of a framework to diagnose the system behaviour by detecting the faults that might occur during the systems operation, and take decisions to overcome such cases. Taking into consideration the existing components of the Middleware, the following components will be reused in the Diagnosis Framework: Context Events, Context Bus and the Situation Reasoner [https://github.com/universAAL/context/wiki| (see Context Group wikipages for more details)]. The Diagnosis Framework, should not create further effort on the operational load of the platform or interrupt other services. The Middleware has a message based communication.Hence, fault detection mechanisms is also using message classification algorithms in order to categorize messages and differentiate all message types interacting in the platform. The diagnosis framework uses a knowledge base of rules that determine the behaviour of the system and define possible solutions. This knowledge base has to be fed continuously with new knowledge and cases to be able to decide in more and more use cases. A Fault injection framework has been implemented to create a high effort testing scenarios for a number of nodes in an uSpace, after the end of this check, a file of feedback results can be used in the knowledge base that is used in the Diagnosis Framework. The Fault Injection Framework in its final version will be fully independent bundle from the Middleware. This will also give universAAL administrators the ability to test the functionality of any uuSpace remotely. The third bundle in the Reliability building block is the Time Triggered patch, this patch is giving the users of universAAL platform the possibility to have an advantage of using a time triggered communication in there uuSpaces where many reliability aspects are taken already into consideration in the infrastructure used in such communication (e.g. global time synchronization, reliable communication of critical events in the system).
|
|
Reliability building block goal is to improve the reliability aspects of the universAAL platform. Therefore the reliability building block is a vertical layer cross over all layers of universAAL, especially in the Middleware. This can be done by dealing with to major challenges of reliability and enhance the system efficiency. The first action point, the creation of a framework to diagnose the system behaviour by detecting the faults that might occur during the systems operation, and take decisions to overcome such cases. Taking into consideration the existing components of the Middleware, the following components will be reused in the Diagnosis Framework: Context Events, Context Bus and the Situation Reasoner [https://github.com/universAAL/context/wiki| (see Context Group wikipages for more details)]. The Diagnosis Framework, should not create further effort on the operational load of the platform or interrupt other services. The Middleware has a message based communication.Hence, fault detection mechanisms is also using message classification algorithms in order to categorize messages and differentiate all message types interacting in the platform. The diagnosis framework uses a knowledge base of rules that determine the behaviour of the system and define possible solutions. This knowledge base has to be fed continuously with new knowledge and cases to be able to decide in more and more use cases. A Fault injection framework has been implemented to create a high effort testing scenarios for a number of nodes in an uSpace, after the end of this check, a file of feedback results can be used in the knowledge base that is used in the Diagnosis Framework. The Fault Injection Framework in its final version will be fully independent bundle from the Middleware. This will also give universAAL administrators the ability to test the functionality of any uuSpace remotely. The third bundle in the Reliability building block is the Time Triggered patch, this patch is giving the users of universAAL platform the possibility to have an advantage of using a time triggered communication in there uuSpaces where many reliability aspects are taken already into consideration in the infrastructure used in such communication (e.g. global time synchronization, reliable communication of critical events in the system).
|
|
|
|
|
|
|
|
==Artefact #1 : Failure Diagnosis Module in universAAL==
|
|
== Failure Diagnosis Module in universAAL ==
|
|
|
|
|
|
|
|
=== Blackbox Description ===
|
|
=== Blackbox Description ===
|
|
|
|
|
|
| ... | @@ -75,7 +75,7 @@ As diagnosis involves the backtracking from Failure to Fault, the knowledge abou |
... | @@ -75,7 +75,7 @@ As diagnosis involves the backtracking from Failure to Fault, the knowledge abou |
|
|
|
|
|
|
|
universAAL platform can be formulated from the diagnosis point of view where the whole platform is divided into FCRs.
|
|
universAAL platform can be formulated from the diagnosis point of view where the whole platform is divided into FCRs.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/rel1.jpg|500px|center|Fault Containment Regions in MW]]
|
|
[[rel1.jpg|500px|center|Fault Containment Regions in MW]]
|
|
|
|
|
|
|
|
In the following, a comprehensive list of Fault Containment Region with respective failure modes is listed. In this list, each of the FCRs is enlisted with its input, output and rationale so that the inclusion if this FCR is justified.
|
|
In the following, a comprehensive list of Fault Containment Region with respective failure modes is listed. In this list, each of the FCRs is enlisted with its input, output and rationale so that the inclusion if this FCR is justified.
|
|
|
The failure modes for each of the components are classified as follows.
|
|
The failure modes for each of the components are classified as follows.
|
| ... | @@ -400,14 +400,14 @@ In common day terminologies, detection and diagnosis are hardly separated. Commo |
... | @@ -400,14 +400,14 @@ In common day terminologies, detection and diagnosis are hardly separated. Commo |
|
|
The integrated diagnosis framework uses the power of the Context bus in universAAL so that looking at any context event gives the indication of any symptom for a fault. It also uses the reasoning power of SPARQL and also uses the Publish/Subscribe model in universAAL. The integrated diagnosis framework is depicted in the following figure.
|
|
The integrated diagnosis framework uses the power of the Context bus in universAAL so that looking at any context event gives the indication of any symptom for a fault. It also uses the reasoning power of SPARQL and also uses the Publish/Subscribe model in universAAL. The integrated diagnosis framework is depicted in the following figure.
|
|
|
|
|
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/DiagnosisFramework.png|600px|center]]
|
|
[[DiagnosisFramework.png|600px|center]]
|
|
|
|
|
|
|
|
From the context bus, the context events related to faults are taken as symptoms for a failure. These symptoms are analyzed by a priori knowledge of the FCR and the related static knowledge on the associated failure mode. These symptoms are further queried by Reliability Reasoner with the help of the KB (Knowledge Base) and [https://github.com/universAAL/ontology/wiki/Dependability Dependability Ontology]. These symptoms can be analyzed either in a rule based approach or simple SPARQL query. The rules for the failure analaysis are inside the Reliability Reasoner. Then the reasoner will publish the context event with the diagnosis information into the context bus. This diagnosis information includes the actions for the failure that have to be adopted for the specific failure modes for that specific FCR.
|
|
From the context bus, the context events related to faults are taken as symptoms for a failure. These symptoms are analyzed by a priori knowledge of the FCR and the related static knowledge on the associated failure mode. These symptoms are further queried by Reliability Reasoner with the help of the KB (Knowledge Base) and [https://github.com/universAAL/ontology/wiki/Dependability Dependability Ontology]. These symptoms can be analyzed either in a rule based approach or simple SPARQL query. The rules for the failure analaysis are inside the Reliability Reasoner. Then the reasoner will publish the context event with the diagnosis information into the context bus. This diagnosis information includes the actions for the failure that have to be adopted for the specific failure modes for that specific FCR.
|
|
|
|
|
|
|
|
==Artefact #2 : Error Detection Unit ==
|
|
== Error Detection Unit ==
|
|
|
|
|
|
|
|
=== Backbox Description ===
|
|
=== Backbox Description ===
|
|
|
In highly distributed system as in AAL environment, where a large number of hardware and software are contributed to serve a certain scenario, the probability of fault occurrence will be significant. Some of the provided services are critical and need to be served with relatively high reliability and availability i.e. the corporate components should provide at least a degraded level of this service even with fault existence. To tolerate the faults in such systems, three interrelated phases should be followed:
|
|
In highly distributed systems, where a large number of hardware and software are contributed to serve a certain scenario, the probability of fault occurrence will be significant. Some of the provided services are critical and need to be served with relatively high reliability and availability i.e. the corporate components should provide at least a degraded level of this service even with fault existence. To tolerate the faults in such systems, three interrelated phases should be followed:
|
|
|
#Fault detection.
|
|
#Fault detection.
|
|
|
#Fault diagnosis.
|
|
#Fault diagnosis.
|
|
|
#Fault masking and recover.
|
|
#Fault masking and recover.
|
| ... | @@ -420,16 +420,6 @@ Because of its importance in fault tolerance operation, an Error detection frame |
... | @@ -420,16 +420,6 @@ Because of its importance in fault tolerance operation, an Error detection frame |
|
|
|-
|
|
|-
|
|
|
| GIT Address
|
|
| GIT Address
|
|
|
| http://github.com/universAAL/context/tree/master/ctxt.error.detection.unit
|
|
| http://github.com/universAAL/context/tree/master/ctxt.error.detection.unit
|
|
|
|-
|
|
|
|
|
| Javadoc
|
|
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
|
| Design Diagrams
|
|
|
|
|
| [https://raw.githubusercontent.com/wiki/universAAL/middleware/Physical_distribution_of_EDU.png Physical Distribution of EDU], [https://raw.githubusercontent.com/wiki/universAAL/middleware/Conceptual_model_of_EDU.png Conceptual Model of EDU], [https://raw.githubusercontent.com/wiki/universAAL/middleware/_Data_structure_in_EDU.png Data Structure in EDU], [https://raw.githubusercontent.com/wiki/universAAL/middleware/_Event_list_calendar.png Event List Calendar]
|
|
|
|
|
|-
|
|
|
|
|
| Reference Documentation
|
|
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
|
|}
|
|
|}
|
|
|
|
|
|
|
|
=== Features ===
|
|
=== Features ===
|
| ... | @@ -476,12 +466,12 @@ The principle of Error detection by using message classification is introduced f |
... | @@ -476,12 +466,12 @@ The principle of Error detection by using message classification is introduced f |
|
|
==== Conceptual model of Error detection unit ====
|
|
==== Conceptual model of Error detection unit ====
|
|
|
The next figure depicts a simple network, which consist of several universAAL aware communication nodes. EDU has been realized in each universAAL node as a separate software component by occupying the location between middleware and the application layer. EDU is not application specific, but it uses some functions from the underlying operating system to ensure its predictable behavior. However, EDU should be configured by the application developer to meet the specification of his application.
|
|
The next figure depicts a simple network, which consist of several universAAL aware communication nodes. EDU has been realized in each universAAL node as a separate software component by occupying the location between middleware and the application layer. EDU is not application specific, but it uses some functions from the underlying operating system to ensure its predictable behavior. However, EDU should be configured by the application developer to meet the specification of his application.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Physical_distribution_of_EDU.png|600px|center]]
|
|
[[Physical_distribution_of_EDU.png]]
|
|
|
|
|
|
|
|
EDU has been designed only to handle the received events by other uAAL-Components. Thus, whenever a uAAL-Component receives a new event from the context bus, it can deliver this event to the EDU to check the events against several fault type that should be predefined at the design time by the uAAL-Components itself. The physical location of the EDU on the receiving node will help the EDU in monitoring the sender status by analyzing its messages. Actually two design possibilities were available; whether putting EDU in the sending side or the receiving side. In some situation it becomes difficult for the sending node to judge itself. Suppose for instance that the sending node has mismatched the system synchronization due to a drift in its oscillation, in this case, it’ll be unreasonable to trust on the node’s decision whether the message timing is correct or not.
|
|
EDU has been designed only to handle the received events by other uAAL-Components. Thus, whenever a uAAL-Component receives a new event from the context bus, it can deliver this event to the EDU to check the events against several fault type that should be predefined at the design time by the uAAL-Components itself. The physical location of the EDU on the receiving node will help the EDU in monitoring the sender status by analyzing its messages. Actually two design possibilities were available; whether putting EDU in the sending side or the receiving side. In some situation it becomes difficult for the sending node to judge itself. Suppose for instance that the sending node has mismatched the system synchronization due to a drift in its oscillation, in this case, it’ll be unreasonable to trust on the node’s decision whether the message timing is correct or not.
|
|
|
As mentioned in previous sections, EDU is relying on message classification concept to detect anomalies in the received messages. Next figure shows the follow of the received message inside the EDU, and how the message classification concept has been realized inside it.
|
|
As mentioned in previous sections, EDU is relying on message classification concept to detect anomalies in the received messages. Next figure shows the follow of the received message inside the EDU, and how the message classification concept has been realized inside it.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Conceptual_model_of_EDU.png|500px|center]]
|
|
[[Conceptual_model_of_EDU.png]]
|
|
|
|
|
|
|
|
First of all the incoming message should pass the syntactic check to see if the received message is valid or not. In fact, the syntactic check tests if the received message has already been configured by the user. If not the message is dropped and doesn’t precede the other processes, at the same time an indication goes to the diagnostic unit to tell him about the invalid message. If the message is valid, a time check should be done to verify the timing of the message. Depending on the timing behavior of the different messages (e.g. periodic or sporadic messages), different time check algorithms may be required. The timely messages should finally pass the semantic check to make sure that the received message is error free. To check the message semantic, different software methods are available. Some of these methods are not application specific and can be applied generally like limit check, 1st derivative check , etc… while other methods require more information about the application like plausibility check, process model based check.
|
|
First of all the incoming message should pass the syntactic check to see if the received message is valid or not. In fact, the syntactic check tests if the received message has already been configured by the user. If not the message is dropped and doesn’t precede the other processes, at the same time an indication goes to the diagnostic unit to tell him about the invalid message. If the message is valid, a time check should be done to verify the timing of the message. Depending on the timing behavior of the different messages (e.g. periodic or sporadic messages), different time check algorithms may be required. The timely messages should finally pass the semantic check to make sure that the received message is error free. To check the message semantic, different software methods are available. Some of these methods are not application specific and can be applied generally like limit check, 1st derivative check , etc… while other methods require more information about the application like plausibility check, process model based check.
|
|
|
If the message has been dropped in any one of these check points, an indication is made to the diagnostic unit to take the suitable decision. However, to take an accurate decision, accurate information of the caught anomaly should be provided from the error detection unit. This information should contain the error type, location, and time to help the diagnostic unit in taking the right decision easily.
|
|
If the message has been dropped in any one of these check points, an indication is made to the diagnostic unit to take the suitable decision. However, to take an accurate decision, accurate information of the caught anomaly should be provided from the error detection unit. This information should contain the error type, location, and time to help the diagnostic unit in taking the right decision easily.
|
| ... | @@ -499,7 +489,8 @@ Because of that and to make the code more flexible, the main core of the EDU has |
... | @@ -499,7 +489,8 @@ Because of that and to make the code more flexible, the main core of the EDU has |
|
|
To achieve the message classification inside the EDU, a pre-knowledge about the message specifications both in time and in value domain are required. These specifications should be delivered by the application developer at the design time and before the using of EDU. An XML configuration file has been created to make it easier for the developer to give the specification of its message. To manipulate the specification of the message, a parser function has been created for parsing the information from the XML file and providing them to the main data structure of the EDU.
|
|
To achieve the message classification inside the EDU, a pre-knowledge about the message specifications both in time and in value domain are required. These specifications should be delivered by the application developer at the design time and before the using of EDU. An XML configuration file has been created to make it easier for the developer to give the specification of its message. To manipulate the specification of the message, a parser function has been created for parsing the information from the XML file and providing them to the main data structure of the EDU.
|
|
|
The data structure inside EDU, consists mainly of a hash table that comprises the message ID as a key and a list of check processes’ structures as a value to the related key, see next figure , where each message should pass a number of check points that are associated to the related message during the design phase. Suppose a certain message which has the message ID “101“as in the next figure. Message 101 is supposed to be configured as periodic message and have an integer value that should be tested against a certain threshold by applying the limit check and the 1st derivative check processes, therefore three check processes should applied on this message. By finding out the message ID in the hash table, a pointer to the head of the check processes list will be returned as value, for our case, the pointer refer to the periodic field. The periodic related information of message 101 such as the period value and the phase value will be found in its structure instance (Periodic struct.). After that a periodic check function will be called to compare between the stored time information and the time information that is extracted from the received message, if message 101 met its time specification, then it is considered as timely otherwise untimely message indication may be given to the diagnostic unit.
|
|
The data structure inside EDU, consists mainly of a hash table that comprises the message ID as a key and a list of check processes’ structures as a value to the related key, see next figure , where each message should pass a number of check points that are associated to the related message during the design phase. Suppose a certain message which has the message ID “101“as in the next figure. Message 101 is supposed to be configured as periodic message and have an integer value that should be tested against a certain threshold by applying the limit check and the 1st derivative check processes, therefore three check processes should applied on this message. By finding out the message ID in the hash table, a pointer to the head of the check processes list will be returned as value, for our case, the pointer refer to the periodic field. The periodic related information of message 101 such as the period value and the phase value will be found in its structure instance (Periodic struct.). After that a periodic check function will be called to compare between the stored time information and the time information that is extracted from the received message, if message 101 met its time specification, then it is considered as timely otherwise untimely message indication may be given to the diagnostic unit.
|
|
|
By terminating the periodic check function, the pointer will refer to the second check process and so on until the check processes list is finished.
|
|
By terminating the periodic check function, the pointer will refer to the second check process and so on until the check processes list is finished.
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/_Data_structure_in_EDU.png| 500px| center]]
|
|
|
|
|
|
[[_Data_structure_in_EDU.png]]
|
|
|
|
|
|
|
|
==== Fault detection mechanism in time domain ====
|
|
==== Fault detection mechanism in time domain ====
|
|
|
To cover fault hypothesis in time domain, two types of messages may be distinguished according to its timing behavior:
|
|
To cover fault hypothesis in time domain, two types of messages may be distinguished according to its timing behavior:
|
| ... | @@ -520,7 +511,7 @@ Time schedule sporadic = previous arrival time + max interarriaval time |
... | @@ -520,7 +511,7 @@ Time schedule sporadic = previous arrival time + max interarriaval time |
|
|
It could be seen that the next scheduling point of time for sporadic message depend directly on the previous receiving point of time. Therefore a static list data structure has been created inside the sporadic check function to maintain the previous time stamping of different sporadic messages from lost. In same manner if previous data are required within the current test function, a static list may be generated, each element of the list contain the message ID field and the previous data that are required.
|
|
It could be seen that the next scheduling point of time for sporadic message depend directly on the previous receiving point of time. Therefore a static list data structure has been created inside the sporadic check function to maintain the previous time stamping of different sporadic messages from lost. In same manner if previous data are required within the current test function, a static list may be generated, each element of the list contain the message ID field and the previous data that are required.
|
|
|
The calendar re-arranges itself dynamically after each message arrival in such a way that the earliest schedule time occupies the head position of the list.
|
|
The calendar re-arranges itself dynamically after each message arrival in such a way that the earliest schedule time occupies the head position of the list.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/_Event_list_calendar.png| 400px| center]]
|
|
[[_Event_list_calendar.png]]
|
|
|
|
|
|
|
|
==== Semantic fault detection mechanism ====
|
|
==== Semantic fault detection mechanism ====
|
|
|
If the ensuring of the deterministic behavior for both the middleware and the communication infrastructure will help a lot in classifying faults regarding time, this will not be the case when a sensor or actuator deviates from its normal operation. It is more complicated to catch an error from the message semantic. However a wide vary of methods are already introduced to detect anomalies of a certain process. These methods may be classified as already done by Isserman in <ref>Isermann, Rolf. Fault Diagnosis System. Heidelberg : Springer, 2006.</ref>.
|
|
If the ensuring of the deterministic behavior for both the middleware and the communication infrastructure will help a lot in classifying faults regarding time, this will not be the case when a sensor or actuator deviates from its normal operation. It is more complicated to catch an error from the message semantic. However a wide vary of methods are already introduced to detect anomalies of a certain process. These methods may be classified as already done by Isserman in <ref>Isermann, Rolf. Fault Diagnosis System. Heidelberg : Springer, 2006.</ref>.
|
| ... | |
... | |
| ... | | ... | |